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BRIDGE/ROUTER ARCHITECTURE FOR Accordingly, it is desirable to provide a higfa performance, 

HIGH PERFORMANCE SCALABLE scalable networking strategy which allows flexible growth 

NETWORKING of a switching, routing engine. Using such strategy, expertise 

RELATED APPLICATION DATA can be brought together quickly to deliver projects or 

This application is a continuation-in-part of prior filed 5 ^ftf fficieD ^ d < *£! ve *K wi * ou J t «^»8 «*rg<= 

U.S. application Ser. No. 08/438,897, entitled NETWORK ^^ZJT^Z^ r^Z unsuitc f. t0 * e 

INTERMEDIATE SYSTEM WITH MESSAGE PASSING E^nf^ , "Z < 

ARCHITECTURE, filed May 10, 1995, which is incorpo- tSSSJSJSiJ!^ n I ^ IK "l* 

rated by reference as if fully set for* hereto. in T^r^S-T ^ 2? d ™ Mllcan y *•* • 

J io scalable platform. Further, the investment in current equip- 

BACKGROUND OF THE INVENTION ment and technologies is protected, while paving the way for 

1. Held of the Invention future technologies. 

The present invention relates to high performance bridge/ SUMMARY OF THE iNVENTrONT 
routers that supply transparent ccmmiinicatioD between a SUMMARY OF THE INVENTION 
variety of types of network interfaces within a single chassis, 13 The present invention provides a high performance, seal- 
integrating such local area network standards as Token Ring, able networking bridge/router system which overcomes 
Ethernet, FDDI, and AIM, and also supporting wide area many of the problems discussed above. The bridge/router 
links. More particularly, the present invention provides an architecture according to the present invention is based on a 
mternetworking device providing high performance, scat- message passing system which mterconnects a plurality of 
able internetworking. 20 input/output modules. The iiTput/output modules vary in 

2. Description of Related Art complexity from a simple network interface device having 
A router is an mtenietworking device mat chooses no switching or routing resources on board, to a fully 

between multiple paths when sending data, particularly functional bridge/router system. Also, in between these two 

when the paths available span a multitude of types of local extremes input/output modules which support distributed 

area and wide area interfaces. Routers are best used for (1) protocol processing with differing levels of intelligence are 

selecting the most efficient path between any two locations; included. 

(2) automatically re-routing around failures; (3) solving The bridge/router architecture according to one aspect of 
broadcast and security problems; and (4) establishing and the inventiou includes a central mternetwerking engine, 
administering organizational domains. One class of router, ^ including a shared memory resource coupled to the high 
often called bridge/routers or Brouters, also implements speed backplane bus. Depending on the level of sophistica- 
switching functionality, such as transparent bridging and the tion supported on the input/output module, the central inter- 
like. One comrrcrcialry available example of such system is networking engine may perform all routing decisions for 
known as Ntflttuildcr H, provided by 3Com Corporation of packets received on a particular port, or may support dis- 
Santa Clara, Calif. ^ tributed protocol processing at an input/output module in 

Because bridge/routers are designed to interconnect a which certain classes of packets are routed locally on the 

variety of networks, the volume of data flow through the inrjut/output module wmle others are forwarded to the 

router can be very high, The ability to move large amounts central engine. The architecture can be characterized as 

of data, according to a wide variety of networking protocols, having a number of components, including a physical layer 

makes the taidge/router a unique class of high perfcrrnance m comriniriication system for transferring control messages 

data processing engines. and data packets across the backplane bus; and a logical 

One problem with prior art bridge/router architectures is layer interprocessor m«gMig i'n g system which operates over 

scalability, and another is backward compatibility. When a the physical layer across the bus supporting communication 

customer buys a prior art system, and fills up the available between intelligent input/oinput modules, and between such 

ports od the system, often the customer is required to buy 45 mnrhiw in the central internetworking engine. Distributed 

another copy of the entire system, which may be much more protocol modules are supported on intelligent iiiput/output 

than is necessary, or scrap the old system and buying a new modules, which <«mmrmiMrfi> using the logical interproces- 

system with a larger number of ports. Thus, the prior art sor nv-ggjiging system with the central mternetworking 

esta bl is h e s jtlatrans in system hardware that are very expen- resources, and with other iiuput/output modules on the sys- 

sive to cross. 55 tern to make routing decisions far a majority of packets of 

The NETBuilder H architecture, which is described in the a particular type received on such systems. As meationed 

parent application from which this is a continuation In part, above, the central internetworking engine also supports 

allows expansion on a pert by pert basis. However, there is input/output modules which only include the network inter- 

a limit to the number of ports that can be Tnonntnrt on the face chip and resources for communicating data across the 

backplane bus of the NETBuilder II architecture, because 55 backplane bus to the central engine, which acts as a data link 

this architecture requires that all of the data lrames incoming layer agent for such systems. The logical layer can overlay 

through the ports gettranrferred across the backplane bus to a variety of physical layers, including in addition to a high 

a centrally shared memory and processed there. speed bus, local area networks such as Ethernet Token Ring, 

An alternative prior art system allows for a number of asynchronous transfer mode ATM, and others. 

sophistiraTfd multi-port router engines to communicate with 60 The centralized kternetworldng engine includes central 

one another. Thus, the only packets that go across the link distributed protocol module servers which manage the dis- 

between the routers are those which must be transferred tributed protocol modules on the input/output modules, 

from a port on one router to a port on another. This when me distributed protocol modules only partially support 

ar chitectu re, however, requires that expansion of the system such protocols. Further, the centralized internetworking 

be done by adding an entire new router engine. Thus, the 63 engine can support maintenance of synchronization between 

system does not allow incremental expansion on a port by distributed protocol modules in the system. Thus, distributed 

P°rt basis. protocol modules may include protocol address caches to 
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support routing decisions locally on the input/output module s aging system, as well as the physical layer protocol support 
for addresses stored in the cache. The central internetwork- communication among the central routing processor and the 
ing engine according to this aspect stores the entire routing plurality of input/output modules with messages in a phi- 
table for the particular protocol, and includes resources for raiity of latency classes, and in a plurality of reliability 
responding to cache update requests from the input/output 5 classes. Thus, certain control messages can be delivered 
modules, and for managing routing data used in the opera- across a system with very high reliability. Data packets in 
tion of the distributed protocol modules on a plurality of transit can be transferred across the system with lower 
input/output modules in the system. In a preferred system, reliability but higher throughput A dropped data packet 
the cache is loosely coupled with the central routing table, from time to time does not affect overall system performance 
using a cache m a n age me nt protocol which is timer based 1Q significantly because network protocols are able to recover 
and requests updates for stale entries in response to traffic fr om sucn lost packets. The critical parameter for transfer- 
using the stale entry. ring data in transit is nunimizing system overhead and 

Accordingly, the present invention can be characterized as mar inuring throughput, 

an apparatus for interconnecting a plurality of networks. Hie in the present invention provides a high performance 

apparatus comprises a communication medium having a 15 scalable internetworking platform based on an interproces- 

physical layer protocol A central rooting processor is SOT messaging system which provides a logical layer for 

coupled to the physical layer. A plurality of input/output communicating among input/output modules, allowing 

modules communicate with the central routing processor diverse Input/output modules. The diverse input/output 

according to the physical layer protocol. The input/output modules may include distributed protocol modules which 

modules have respective sets of physical network interfaces ^ communicate with a distributed protocol module server ou^a 

which support a variety of LAN and WAN network proto- centralized resource) liis allows a large amount of routing^ 

cols. An intcrprocessor messaging system in a logical layer ^de^ons to be 

above the physical layer protocol is executed in the central witiout reqrir^ 

routing processor and in a set of one or more intelligent cztte'centndized jixxxssorJ Rather, these packets are trail s- 

input/output modules within the plurality of input/output ^ f cnt^directiy fromport to port in the system, Tnaximirmg 

modules. Wstributed protocol services arc exeoited_CTer the n efficiency of the paths in the device. The centralized 

intcrprocessor messaging system, and include a distributed internetworking engine allows synchronization of internet- 

protocol module -in at least one of the plurality of input/ working functions, and provides for handling of exception 

output modul es which makes rwiting decisions snppcrted by^3 packc(s ^ me n os^ which occur rarely and need not be 

the districted proton 3q supported in the distributed modules. 

okk^^ processor which in aspcct5 wd advantages of the present invention can 

'response to queries from die distolxn^ pro^l module ^ eeGn {]pon review of the figures, the detailed description, 

mafe routing tle^ and the claims which follow. 

c_ module. Further, a particular input/output module in the 

plurality may include resources for signalHng the central 33 BRIEF DESCRIPTION OF THE FIGURES 

routing processor about events across the physical layer FIG. 1 provides a system block diagram for a scalable 

protocol According to this aspect, there are centralized network intermediate system according to the present inven- 

routing resources executed in the central routing processor tion. 

over the physical layer protocol in response to such events FIG. 2 provides a block diagram of a basic input/output 

for making routing decisions on behalf of such input/output ^ module (IOM) which may be used in the system of FIG. 1. 

ino&iles. ^ FIG. 3 provides a block diagram of a serm-intelligent I/O 

The distributed protocol services according to the module (IOS) such as used in the system of FIG. 1. 

mventiramay PK3. 4 provides a block diagram of an input/output 

kygjn^^ routing or mo dule with an enhanced function processor (LOP) such as 

/ switch decisions based on other intemetworkihg proto-> 45 ased in the system of FIG. L 

[cob. The distributed protocol modules may include a pro- FIG. 5 provides a block diiigram foe central 

tocol routing tahle cache and the distributed protocol module warking processor (COX) used for providing a shared 

server includes resources for main t ainin g a central protocol rcflourcefor the other processors coupled to the buses in the 

routing table and supporting the protocol routing table system of FIG. 1. 

c&dlc& * A 30 FIG. 6 is a heuristic diagram providing an example of 

The kterprocessor messaging system includes resources transmission for the system of FIG. L 

for transferring control messages and network packets in m 7 illustrates message flow scenarios for a system 

transit among the central routi^ suchks toTsh^i^^r 

outputmodulcs ^J?J f J^&*\ ^ FIG 8 provides a diagram of the dual queue structure and 

access tofce centralized underworking engine. This ^fwrr?^ forth the data transfer types in the 

enables tremendous flexibility in the design and expansion system of FIG. L 

of bridgcAouters according to the present invention. Further, FIG. 1# illustrates the data alignment and packing for 

because the centralized internetworking engine supports 60 mcssa S cs transferred on the bus, 

^mwinniV^nitfKTnttiatty b ftckpi»?* ft w*fo inpirt/nutputinoA. FIG. 11 illustrates the receive data structure layout for the 

ules without higher layer protocol processing, backward bus interfaces. 

compatibility is ensured, as well as the ability to incremen- FIG. 12 provides a receive queue example for the bus 

tally expand an existing system with one network interface interface according to the present invention, 
at a time growth. 65 FIG. 13 illustrates the data flow from the high and normal 

The system further provides tremendous flexibility in priority command lists to the high and normal priority 

utilization of the backplane. Thus, the interprocessing mes- receive lists according to the present invention. 



04/30/2004, EAST Version: 1.4.1 



5,802,278 

5 6 

FIG. 14 illustrates the message transmit data path in (he 1 SYSTEM DESCRIPTION 

bus interface. pj G j p^yi^ a board level block diagram of a scalable 

FIG. 15 illustrates the message transmit address path in bridgeAouter illustrating the present invention. The bridge/ 

the bus interface. router includes a central control card COX 19 coupled to a 

FIG. 16 illustrates the message receive logic in the 3 ^ big 0 s P eeA parallel bus 11 and a second high speed 

message passing controller. parallel bus 12. A plurality of input/output (I/O) modules are 

FIG. 17 illustrates the command list data transfer logic cou P led to the bus 11 to provide input/output functions for 

within the message passing controller. connected networks. The plurality of I/O modules includes 

FIG. 18 illustrates the free list structure and its associated 10 mt^Jl V 

registers in the MPC and the free buffers in the SDRAM. I0M f 3 }\™ ^"^f ^ 

~L semi-mtelligent processing device IOS 15 and 16, and a 

FIG. 19 fflustrates (he command list bit definition for a more powerful processing system IOP 17, 18, 19, and 20. 

message type transfer. The IOP boxes 17-21 include interfaces to both high speed 

FIG. 20 illustrates the command list hit definition for a buses 11 and 12. 

non-message type transfer. 15 Each of the plurality of processors has at least one 

FIG. 21 illustrates the message address field for a message associated network connection. Thus, the IOM boxes 13 and 

transferred on the bus. 14 include two network connections each, which might be 

FIG. 22 is an overview of the major components in the coupled to, for instance, Ethernet or token ring local area 

centralized internetworking engine and the intelligent input/ „ networks. The IOS boxes 15 and 16 include five connections 

output modules according to the present invention. each, coupling to local area networks (LANs), such as 

FIG. 23 provides an overview of the interprocessor com- Ethernet, FDDI, token ring, or the like and/or wide area 

munication components for use in the system described in networks (WAN) links. The IOP boxes 17-20 have eight 

FIG. 22. network connections each and handle much higher through* 

FIG. 24 provides a perspective of the data paths in the 25 ***** L , 
intelligent input/output modules far the system of FIG. 22. The basic IOM box 13 is illustrated in FIG. 2. It includes 

FIG. 25 is a table showing the kterproccssar messaging t^Z\i^^^^l celled to a 

system message types andtbeir priorities according toone tr * nS ^ V f* 1 ^"^«W>MA MAC * 2 >™t? 
Zplcwctzfon ofAe present inWntfoiL ^r^^ f <? * &™ ^ <>* ™* 

mcjra™ mvcnuoiL ^ chfc is coupled to a bus interface dnp 33 with associated 

FIGS. 26 through 29 illustrate the message formats sup- M configuration data 34, and through the interface chip 33 to 
ported by toe mterprccessor messaging system according to a backplane bus connection 35. The IOM box shown in FKj. 
one embodiment of the present invention. 2 relics primarily on the central control box COX 16 for the 

FKj. 3# shows the functional operation for the interpro- management of data transfer and control functions, 
cessormessagiiig system binTerc^ type message transfers. M The bus interface chip 33 is described in detail in our 

FIG. 31 shows the interprocessor messaging system logi- co-pending U.S. patent application entitled INPUT/ 
cal layer processing for data transfers from the central OUTPUT BUS ARCHITECTURE WITH PARALLEL 
engine to the input/output modules. ARBITRATION, application Set No. 08/033,00$, filed Feb. 

FIG. 32 shows the iitferprocessor messaging system logi- 26* 1993, invented by Mark Isfeld, et aL Such application is 
cal layer processing for data transfers from an input/output 40 inctHptHated by reference as if fully set forth herein to rally 
module to central mternetworking engine. provide a detailed description of the bus architecture in the 

FIG. 33 illustrates the components of distributed internet Fcferred system However, this particular bus architecture 
protocol (IP) processing according to the present invention. not to be lirnitmg. The preferred system uses a 32 

FKJ. 34 illustrates the components of distributed trans- ^"ff^tSf * ^ * 2 * MHz d<X *^ 

n _ nt Mtinnc, nrrwv^c™ Z 45 prffinWy a 50 MHz dock, for a nominal data rate of 800 

parent bridging processing according to the present inven- j^ps (mcgabits ^ OT 1600 MBPS. Even higher 

, ? . _ , t _ data rates can be achieved with state of the art high speed 

FIG. 35 mustrates apphcaUon of the scalable architecture panUld bus architecture, or other data transfer techniques, 
across a IAN or WAN backbone. Also, the backplane may be implemented using a variety of 

DETAILED DESCRIPTION 50 * oca * arca nctwca * technologies as discussed below with 

reference to FIG. 35. 

A availed *^°n of ananbc<timent of the present The sexm-intelligent VO processor IOS, 15 and 16, is 

TtnlZ u!^ to mGS fflustratedm^AscanbTseen^ 

« flhistrate a baacbardware envrronment for the system to the bus 11 through the bus interface chip 40. A non- 

apptoed as a network uxUv^Oi^f^ JIG^ Ml J5 volatile memory device 41, such as an EEPROM, stores 

illustrate the message transfer hardware and techniques configuration data and me like for the bus interface 40. A 

apph^according to the present invention in the environ- data interface to an urtennediate bus 42 is provided through 

Mt °f FIG * latchcs 43 Also, a local memory 44 and a DMA control 

FIGS. 22 through 32 illustrate the processing resources module 45 are coupled to the mtennediate bus 42 and the 

and the logical layered interprocessor messaging system 60 local memory 44. An intelligent micrciiroccsscr 46, such as 

used over the physical layer. the Am29G30 nunufactured by Advanced Micro Devices, 

FIG. 33 shows an internet protocol (IP) distributed pro- Inc., is coupled to the intermediate bus 42. A flash program- 

toool module and distributed protocol module server; and mable read only memory 47 provides storage for programs 

FIG. 34 shows distributed protocol resources for transparent executed by the processor 46. A console port 48 is provided 

bridging according to the present invention. FIG. 35 shows 65 through a UAKT interface 49 to the bus 42. A plurality of 

use of the scalable architecture with a LAN or WAN network connections, generally 50, are coupled to the bus 42 

backbone. through respective physical interfaces 51-1 through 51-N, 



04/30/2004, EAST version: 1.4.1 



5,802,278 

7 8 

and medium access control MAC devices 52-1 through to completion on a time available basis, with some excep- 
52-N. The box may include status light emitting diodes 53 tions. When system performance is measured, it is primarily 
connected and controlled as desired by the particular user. the forwarding capacity of the router in terms of bandwidth, 
FIG. 4 illustrates the block diagram of the higher perfor- packets-per-second, and fan-out mat is considered, with an 
mance input/output processor IOP of FIG. 1. This system is * implicit assumption that the control and management func- 
coupled to the first bus 11 and the second bus 12 through tions w* 11 te sufficient The control and management soft- 
respective bus connectors 60 and 61. The bus connectors 60 ware comprises the vast majority of the code and can use 
and 61 are coupled to message passing controller ASICs 62 ^ amounts of data space, but most of the data space 
and 63, respectively, which are, in torn, connected to an consumed by these functions need not be shared with the 
intermediate bus 64. Hie wtermediate bus (also called 10 forwarding software. 

internal bus herein) is coupled to a shared memory controller Id the system of FIG. 1, the forwarding function is 
65 which controls access to a shared memory resource 66. replicated in distributed protocol modules in the semi- 
Tike intermediate bus 64 is coupled through a peripheral bus intelligent and full function processors IOS and IOP, with 
interface 67 to a network data bus 68. On the network data distributed protocol module servers along with the full 
bus, there are a plurality of network connections, generally 13 function routing and other centralized functions running on 
69, made through respective MAC devices 70-1 through the single central processor COX. Thus, the forwarding 
70-N and physical interfaces 71-1 through 71-N. The shared functions where possible run on processors near the physical 
memory controller 65 is also coupled to a control bus 72, interfaces, and mechanisms, including hardware supported 
which is connected to a high speed processor 73, flash message passing, tie the distributed processing modules to 
programmable read only memory 74 storing programs, 20 each other and to the central control functions. Ibis archi- 
nonvolatfle EEPROM memory 75 storing parameters and tecture allows some forwarding functions to be distributed, 
static code, and a console port 76 through a UAKT interface while others are centralized on the central control box. The 
77, message passing architecture enables significant flexibility 
The central control box is illustrated in FIG. 5. This box in the management of the location of software in the router 
is basically similar to the box of FIG. 4. Thus, the box 25 architecture. Further, backward conmatfciliry and system 
includes a first bus connector 80 and a second bus connector scalability are preserved 

SI for the first and second buses, respectively. Message H MESSAGE PASSING STRUCTURES AND PRO- 

passing controllers 82 and 83 arc coupled to the bus con- CESSOR 

nectars 80 and 81, and to an intermediate bus 84. Aperipb- The basic message passing technique is illustrated with 

end bus transfer ASIC 85 is connected between the inter- 30 respect to FIG. 6. In FIG. 6, the process of receiving a packet 

mediate bus and a peripheral bus 86. An Ethernet controller on interface 2 on card 4 is illustrated. Thus, the packet is 

87, an Ethernet controller 88, and a wide area network received and proceeds along arrow 100 into a buffer 101 in 

(WAN) controller 89 are coupled to the peripheral bus 86 the card. While it is in the buffer, the processor parses the 

and to the respective networks through physical connections packet, looks up file destination for the packet, and processes 

90, 01, and 92. 35 it according to the routing code. Next, a software header 102 

The intermediate bus 84 is also connected to a shared is added to the packet Then, the packet is added to a queue 
memory controller 93, and through the shared memory 103 for message transmissioiL The hardware 104 in the card 
controller 93 to a shared memory resource 54. A second mc message in a fragmented state, which includes a 
shared memory resource may also be connected directly to « message packet 105 which has a start identifier, a 
the MPC ASIC 82 or 83. The shared memory controller 93 channel identifier, and a destination slot identifier (in this 
is also connected to a processor bus 95 winch intercorinccts case, dot 5, channel 3). The first packet includes the 
a processor 96, working memory 97 for the processor, flash software header which identifies 1hc destination interface as 
memory 98 for processor code, EEPROM rnemory 99 for interlace 3 in processor 5, the length of the packet, etc, 
static code and parameters, a PCMCIA interface 100 for Packet 105 includes the first part of the packet data. The next 
accepting flash memory cards for upgrade purposes and the fragment of the message 106 includes a header indicating 
like, a floppy disk controller 101 for driving a floppy disk, acstmatkm slot and its channel as well as packet data, 
an SCSI interface for connection to a hard disk 102, an Tk* packet 107 includes the destination aiid its channel, 
interface 103 for coonection to a froiUpaiid pro\riding a user and indicator that it is the last packet or "end* in the 
interface, and a dual UAKT device 104 which provides for ^ message. Finally, this last packet is filled with the balance of 
connection to a console 105 and a debug port 106. In message data. These three fragments of me message are 
addition, read only memory 107 may be connected to the transferred across the high speed bus 108 to the destination 
processor bus 95. The native PCMCIA interface is provided slot 5. In slot 5, the hardware 109 receives the packet, 
for enabling a redundant reliable boot mechanism. reassembles it in the next free buffer 1 10, and queues the 
™,™ «~ « ^^..^ ^ message to software in the queue HI. The software and 

The srtware processing for a ^ performance router 55 in the IOP at slot 5 transmit the packet out 

breaks fairly cleanly into two major pieces; the data for- 7^™*L I/JTm 

warding functions and the control/management functions. ^f** ^ m card 5 across the arrow 1UL 

The data forwarding functions include device drivers and Thh message passing protocol is a "push P^gm, 

link-layer protocols such as HDLC-LAPD in addition to the wmch has the effect of using the bu s more like a LAN than 

per-packet processing involved with recognizing, validating, w a normal memory bus. This has several miportant features: 

updating, and routing packets between physical interfaces. Receiver allo c a t es /m a n a g es buffering independent of 

lie control and management software functions include transmitter. 

routing protocols and network control protocols in addition Single "address" used for all data sent in one message, 

to all configuration and management functions. Bus addressing is per-card, port-level addressing in soft* 

Id general, the data forwarding functions are optimized 65 warc hca<tcr - 

for maximum performance with near real-time constraints, Bus used in write-only mode, 

whereas the control and management functions simply run No shared memory usage. 
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Reliability not guaranteed (must be supplied at a higher Note that the system is designed to require only loose 
level, if needed). synchronization between processors. There are no critical 
Messages are sent as a stream of cells, interleaved with real-time constraints on any control messages between pro- 
other message transmissions. cessors that would cause the system to break if they were not 
The paradigm provides the following benefits: 5 met. All intcr-prooesscc control functions must tolerate lost 
Improved protection/robustness. messages. Same data loss will be acceptable. For instance, 
Reduced driver overhead. a route cache update or a port down message could be lost, 
Reduced complexity, per-destination queues not required, as long as the system continues to run smoothly, 
etc. At the lowest layer above the actual data movement 
Improved bus utilization (about 3x previous). 10 function is a dual-queue structure, as illustrated in FIG. fi, 
Bus is not monopolized by one device during a message which supports these message classes according to their 
transmission. primary service requirements. These queues may be sup- 
Other slots can interleave cells on the bus, so they do not ported in software, in hardware, or in a combination of the 
have to wait for a long message from another slot two. One queue is designed to provide hi gh reliability and 
In FIG. 6, IOP4 receives a packet, and sends it to IOPS. is low latency with relatively low throughput, and is used for 
Note mat the input card simply sends the message to the the first two classes of messages — internal and network 
output card The sender does not need to allocate buffers or control messages. The second queue is optimized for high 
get permission from the receiver A hardware address sped- throughput and supports the majority of the data traffic, 
fies the slot that should receive the message. A software Bom control messages and data packets are encapsulated 
message header specifies the message type (control, data, 20 with a standard header which conveys the message type, 
etc), its actual length, output port number, etc. The output destination addressing (output port, control interface, etc), 
card is responsible for dropping messages if there is too and other control information associated with the message, 
much traffic. For internal control messages this additional information 
FIG. 7 is an example of how messages will flow in the might include sequence numbers, event handles, etc., while 
system of FIG. 1 in order to forward a packet In this 25 data packets might have MAC encapsulation type, traasrrris- 
exarnple, the path that a packet follows to a destination sion priority, etc 

unknown by the receiver card IOP1 is shown. FIG. 8 illustrates the basic dual queue structure used in the 

Packet enters from network attached to IOP1 (transition messaging paths. In this structure, the card will include a 

1). The local processor looks up the destination (whether it plurality of physical interfaces, generally 150. Inbound data 

be bridged, or routed by various protocols), and finds it does 30 from the physical interfaces is placed in an inbound mnlri- 

not know what to do with this packet It generates a high placing packet processing queue 151, generally imple- 

priority cache lookup request and sends it to the COX. The mented by software. This packet processing queue does the 

COX looks up the destination network in its database, and basic data transport processes as described above. From this 

sends back the answer to IOPl(3). IOP1 adds the destination queue 151, the packets are transferred to a high throughput 

to its cache, and finds the held packet It then directly as queue 152 implemented at either hardware or software, 

forwards it to K)P2(4) as a message complete with instruc- From the high throughput queue, packets are transferred out 

tions on what to do with the packet IOP2 examines the onto the bus transmission path 153. Alternatively, commu- 

message header and determines it should transrnh me packet ideations which must be reliable are passed through a 

out port X(5). IOP2 DTD NOT examine me actual packet in reliable receive and tnuismit block 154 where they are 

any way. It simply looked at a simple message header, and 40 tagged for preferential handling at the receive end, and 

decoded the command to transmit the enclosed packet to manually passed to a bigh priority, low latency queue (HRQ 

P 0 * x 155)out through the bus transmit function 153. Similarly, 

If the packet originated from an IOM, then the IOM puts data received from a bus receive path 156 is passed either 

me packet in COX rnemcry. The COX does the same through a high reliability queue 157 or a high throughput 

functions as outlined above, far the IOM based packet 45 queue 158. The high rcHahfflty queue is passed to the 

Packets destined for an IOM are sent to the COX which reliable receive and transmit block 154 into the outbound 

queues them for transmission. In other words, existing IOMs dermiltiplexirig packet processing queue 159. Alternatively, 

are just ports on the COX as far as the message passing control and management functions 160 receive data through 

paradigm goes. the reliable path. The cvtbound software queue 159 sends 

Also notice mat iflOPl has the destmanon alreaxry stored so appropriate packets to the physical interfaces 150. There 

in the local cache (normal case), then messages 2 and 3 are may also be a path between the Inbound and outbound 

ftHtninatrd In either case the packet data only travels across software queues 151 and 159. 

oncc - As illustrated in the figure, preferably the lower level 

This system uses a layered architecture for commiinica- queues 152, 155, 157, and 158 are implemented in the 

tion between processors, with a common set of message ss hardware assisted environment while the higher level queues 

passing services supporting bom control and data paths. It 151 and 159 are software executed by a local processor on 

utilizes the bus for the physical layer and either shared- the board. However, in the central processor unit, the lower 

memory DMA-based software or hardware-supported card- level queues may be implemented in software which serves 

to-card transmissions to provide required services for vari- the IOM blocks described above with respect to FIG. 2, and 

ous classes of messages. The three major classes of message 60 interface processors may be implemented in the particular 

arc: application with these queues in software. 

Internal control messages: low latency (<10 ms), high FIG. 9 provides a table of the various data transfers 

reliability, low throughput supported by the system of the preferred embodiment The 

Network control messages: medium latency (<250 ms), table indicates the transfer type across the top row, including 

high reliability, low throughput 65 a message transmit, a shared memory write, a shared 

Normal data packets: average (best effort) latency, aver- memory access read, a shared memory read, a memory 

age (best effort) reliability, high throughput move, a cell transmit, a message receive, a bus input/output 
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and bus memory write, a bus read, and a promiscuous 
receive transfer. The table summarizes the source of the 
source address, the source of the destination address, the 
direction of the transfer, the origin of the cycle, the receive 
activity, the data buffering, and the alignment and packing 5 
functions for each of the different transfers. 

Thus, the system includes a number of hardware and 
software system buffer structures and control and manage- 
ment modules. Generally, data fragments arc gathered and 
byte-wise aligned to form cells which move across the bus. 10 
At the receiving end, cells may be placed into a receive 
buffer as allocated by the receiving processor. 

The basic structures include a command list, a free list, 
and a receive list 

The command list is a managed string of four word entries 13 
through which software instructs hardware to perform cer- 
tain data transfers, generally across the bus. The blocks of 
memory to be moved may be thought of as buffers, or as data 
fragments. There is no hardware requirement for these 
chunks of data to be aligned or sized in any specific way. 20 
Implicit in the source and destination address along with the 
command list entries control field is the type of data transfer. 
The command list is bmlt in synchronous dynamic RAM 
(SDRAM) and may be FCFOcd (or cached) within the 
message passing controller hardware. Software writes 25 
entries into the command list, while hardware reads and 
executes those commands. The command list is managed via 
command head and command tail pointers. 

The free list is a series of single word entries pointing to 
available or "free* 9 receive buffers which may be allocated 30 
by hardware for buffering inbound bus data. The free list is 
maintained in SDRAM and may be FIFOcd or cached within 
me message passing controller hardware. Software places 
free receive buffers into me free list so mat hardware may 
then allocate a free buffer to a given receive channel, as 35 
required by incoming data. Once the buffer is actaially filled, 
hardware places the buffer pointer into one of two receive 
lists. Only software writes entries to the free list, and those 
entries are known to be valid by the contents of the software 
based free tail pointer. Hardware may read entries from the 40 
list, and the only indication of what has been read is the 
value of the hardware-owned free head pointer. 

The receive list is a series of two word entries pointing to 
full receive buffers which need the attention of software. The 
list itself is SDRAM resident and the list entries point to 45 
receive buffers which also reside in SDRAM. In addition to 
the physical address of the filled buffer, the receive list entry 
includes a flag and count field. 

FIG. 10 shows the data flow beginning with a command 
list and eventually showing up on a normal priority receive so 
list 

As can be seen in FIG. 10, a command list 200 includes 
a sequence of four word entries. For example, the four 
entries 201, 202, 203, and 204 characterize a transfer from 
a network interface in one processor across the bus to a 55 
network interface in a different processor. The first entry is 
recognized as the beginning of a message, includes a pointer 
204 to a source buffer, a destination address 205 indicating 
the destination slot (and bus if plural busses are used) of the 
message, and a data length field 206. The next entry 202 60 
includes flag indicating that it is a middle fragment, a pointer 
207 to a source buffer, and a data length field. The third entry 
in the list 203 includes a control parameter indicating that it 
is a middle fragment, a pointer 200 to a source buffer, and 
a data length field. The final entry 204 includes a header 65 
indicating that it is the end of the message, a pointer 209 to 
the source buffer and a length field. 
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The transmit buffers pointed to by the pointers 204, 207, 
208, and 209 contain the data of the message. They are 
concatenated according to the protocol and data length 
information in the first buffer pointed to by the pointer 204. 
The message packing buffers are used to generate a first bus 
cell generally 210 which includes a destination slot address, 
an indicator that it is the first cell in a message, and a count 
The first cell in this example includes the contents of the 
buffer from pointer 204, the buffer from pointer 207, and a 
portion of the buffer at pointer 208. 

The balance of the buffer at pointer 208 and the first 
portion of the buffer at pointer 209 are combined into the 
second cell 211. The balance of the buffer at pointer 109 is 
placed into the last cell 212. 

The outbound path in the receiving processor loads the 
mcoming data Into the receive buffers 213 and creates an 
entry in the normal priority receive queue for the receive 
buffer. 

The receive data structure is illustrated in FIG. 11. 
Basically, an mcoming data stream is allocated to receive 
buffers using the free list 220, the channel status SRAM 221, 
the free list FIFO 222, and the high and low priority receive 
queues 223 and 224. 

The hardware keeps state information for 32 receive 
channels. Each channel allows one message to be assembled 
into a cohesive message in memory. Hie channel keeps 
pointers to the next place to store the cell as well as a count 
and status information associated with the message. In one 
embodiment, receive channels are allocated to particular 
slots. Thus, slot zero on the bus win be given channel zero, 
for every processor on the bus; slot one will be given 
channel one; and so on. 

The free list 220 is managed with a free head pointer 225 
and a free tail pointer 226. Basically, buffers between the 
hardware owned free head pointer 225 and the software 
owned free tail pointer 226 arc available for the hardware. 
Buffers pointed to by pointers above the free head pointer 
are either invalid because they contain data from previously 
received messages yet to be processed, are in use by a 
particular channel, or have been taken over by the hardware 
and loaded into the free list FIFO 222. In the example 
illustrated in FIG. U, the invalid pointer N and invalid 
pointer 0 represent pointers to buffers which have been 
processed, and would be available for hardware when the 
free tail pointer is moved by the software. 

FIG. 12 provides a receive queue example. The receive 
queue 230 is managed using a receive queue head pointer 
231 and a receive queue tail pointer 232. Each entry in the 
receive queue includes flags, count, and a buffer pointer for 
a specific buffer. Thus, those entries between the head 231 
and the tail 232 contain pointers to buffers in use. Thus, an 
entry 233 includes a fl ag indicating that it is both the first and 
the last cell in a particular message, a length value, and a 
channel identifier. Entry 233 also inclndes a buffer pointer to 
the end of buffer 234. In an alternative embodiment^ me 
buffer pointer points to the beginning of the buffer, as can be 
seen, this is a pointer to a buffer in channel three of length 
80. 

The next entry 235 is the first buffer in a 256 byte transfer 
in channel three with a pointer to buffer 236. The next buffer 
in this message is characterized by entry 237. It inclndes a 
pointer to buffer 237 and a parameter indicating that it is the 
middle transfer in the message. The last cell in this message 
is characterized by entry 239, which includes a pointer to 
buffer 240. The other examples shown in FIG. 12 include 
transfers that are characterized through a second channel, 
channel two, as described in the figure. 
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Hardware calculates the difference between the free head establishes the reliability of transfer, and the destination 

and the free tail pointers and uses that value to dedde when receive queue to which the message is dispatched, are 

to drop buffers in the receive queue to free up space to accept specified in the message address, as control bits in a pre- 

additional messages. This mechanism provides higher reli- fared embodiment Thus, from the hardware point of view, 

ability to high reliability queue, and a lower reliability to 5 the receive lists 255 and 256 are identical in behavior, 

high throughput transfer queue which are found in the Software manages the processing of messages listed in the 

receive list Hardware will provide a number of watermarks high priority receive list and the normal priority receive list 

that can be used to detennine whether to accept new high as desired in a particular implementation. Fox example, the 

throughput queue messages, or whether to drop them. The software may process all high priority receive list messages 

high throughput messages will be dropped so mat the free 10 first, so that so called low latency messages can achieve 

list will not become depleted and high reliability queue lowest latency available. High throughput messages will be 

messages can always be received. The source of the high routed into the normal priority receive Hst, and managed as 

reliability queue messages either has to have exclusive quickly as possible by the receiving processor, 

permission to send X number of frames, or get new permis- FIGS. 14 and 15 illustrate the data paths and address paths 

sion occasionally through a handshake protocol, or the 15 for message passing controller hardware. The message 

sender can rate limit requests to some number/second that transmit data path is illustrated in FIG. 14. The possible 

the receiver must be configured to handle. sources of the data include a processor write d»ta cn line 

This mechanism will also be used to provide several 260, data from the local synchronous DRAM (SDRAM) on 
levels of priority to provide some level of fairness among the line 261, and data from the bus on line 262. The path on line 
high throughput traffic. The concept is to mark a small 20 260 which provides processor write path is not used in one 
number of packets per second as high priority, and the rest embodiment of the invention. Data is directed to the bus on 
as normal priority. The receive hardware win start dropping line 263, to the local synchronous DRAM on line 264, or to 
m&inal priority nrasagesfh^an^ the local processor directly during a processor read opera- 
each slot can get data through, even in the case of another tion on line 265. The processor write data is supplied 
sender trying to hog the bandwidth. 35 through a bus write buffer 266 to an output inulriplexcr 267. 

FIG. 13 illustrates the command list and receive Hst Data from the SDRAM on line 261 is supplied through 

processes according to one embodiment of the present niuluplexer 268 across hue 269 to a packing cell buffer 276. 

invention. As can be seen in the figure, the transmit side The output of the packing cell buffer 270 is supplied on line 

includes a high priority command hst 250 and a normal 271 to the output multiplexer 267. It is also supplied in 

priority command list 251. In the message passing process, 30 feedback to the inbound multiplexer 272. 

a command transmit function 252 is included which is Data from the bus on line 262 is supplied to a receive cell 

coupled with both the high priority command list 250 and buffer 273, the output of which is supplied as a second 

the normal priority command list 251. This transrnit function multiplexer 272. Also, data from the bus is supplied as a 

252 transmits commands across the backplane bus 253, or second input to the multiplexer 268 which supplies input to 

other c ommunic a ti on media such as a LAN, to a receive 35 the packing cell buffer 270. Further, data from the bus is 

filtering process 254 at the receiving end of the message supplied on line 265 directly to the processor read path, 

transfer. Receive filtering process 254 also includes dispatch As can be seen in the figure, the message transmit data 

logic which dispatches the messages to either a high priority path is sourced from the SDRAM on line 261, and selected 

receive list 255 or a normal priority receive list 256. through mitftiplexer 26ft into the packing cell buffer 270. 

In operation, these functions are managed by software 40 From the packing cell buffer 270, it is supplied through 

according to latency, throughput, and reliability of the mnlripl*^ 267 out onto the bus. 

messages being transmitted. For example, software may FIG. 15 illustrates the address path structures, and the 

write commands for messages mat require low latency into message transmit address path. As can be seen, the addresses 

the high priority command list 250, while writing the are generated in response to the command lists 300, and 

majority of commands which require high throughput into 45 from the bus address in line 301. Addresses from the 

the normal priority command list According to this command list drive a source address generation block 302, 

approach, the command transmit function 252 can select and a destination address generation block 303. The output 

c ommands for transmission according to a simple priority of the source address generation block is supplied through 

rule: any high priority message goes ahead of any normal multiplexer 304 to the address out muluplexer 305. The 

priority message. Mare complex priority schemes, including 50 output of die destination address generation block 303 is 

fairness concepts and avoiding lockouts could be litfHral as supplied through the message address generator 306 to the 

suits a certain im pl e m en t a t i o n. Messages transmitted across bus address output multiplexer 305, and to the mumplexer 

the h ark pl a n e 253 are accepted by the receive filtering 307 in the inbound path. Also, the destination address 

function 2S4. The filtering function drops the cells according generation output is supplied as a second input to muM- 

to the available buffers as measured against watermarks 55 plexer 304 in the output path, and as an input to multiplexer 

based on reliability tags in the message header, and routes 308 in the input path. The source address generation block 

the received messages to either the high priority receive list also sources the synchronous DRAM read address line 309. 

255 or the normal priority receive Hst 256, based on a control Other inputs to the imiltipiexer 305 include a processor 

bit in the message header For example, in a system with two read address directly from the local processor on line 310, 

receive buffer watermarks, mere will be three levels of 60 and a tag address on line 31L 

ieliability (or cell loss priority) established. All mose cells in The bus address register 312 is driven by the address in on 

a first class will be dropped if the number of available line 301. The output of the register 312 is supplied through 

receive buffers falls below a first watermark. Messages in a muluplexer 307 to the message address register 313. This 

second class will be dropped when the number of available address register identifies the channel for the message which 

buffers falls below a second watermark. Messages in the 65 is used to access the channel status RAM 314. The channel 

final class are dropped only if there are no receive buffers status RAM supplies a receive buffer address as an input to 

left to receive the message. Both the watermark class, which muluplexer 308. The m***ifn»flm also includes a promiscu- 
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ous receive address generator 315 which supplies a third Quite a bit of the logic in this section is associated with 

input to the multiplexer 348. The output of the multiplexer management of the receive buffers and bus logical receive 

308 is the synchronous DRAM write address counter 316, channels. 

which drives the synchronous DRAM write address on line The major functional blocks are summarized as follows: 

317. 5 get_Jree_buffs 500 

As can be seen, the message transmit address path arigi- Maintain status of the double-buffered free list buffer 

nates with the command list 300. The camrnand list drives pijj. rca <j requests and manage movement of data 

the source address generation block 302 to supply a syn- mto me flb from muS. Contains free_Jiead_jrg, frec_ 

chronous DRAM read address on line 309. Also me com- tail_jeg, free_start_jeg, and free_size_jeg registers, 

mand list drives the destination address generation block JQ buffer ^ 

303 to supply a message address generator 306. This basi- Allocate buffers from (he FLB to logical receive channels, 

callv supplies the slot number and channel number fox the **^T^ & ± . .... ^ 

SagTto be suppHed on the output bus. ™ l *T*™*f Z?1T*£ 

bHw« iniSmessage transmit from a coimnandlist *annd buffer CSB ^^^^f^nit t 

inaintained in SDRAM. The message may consist of mul- Maintain me Channel Status validity register. ™sm«luleis 

uple fragments stored in SDRAM memory which are then 15 necessarily quite intimate with an icb_flush module 504 

packed into double-buffered outbound cells. The bus transfer which needs to mark CSB entries invalid as they are flushed 

address is really a message control field containing such to receive buffers and which needs to check for a valid 

things as a field identifying the cell as part of a message, the channel status entry before flushing an ICB message cell to 

destination slot and logic channel, first and last cell control SDRAM, 

bits, and the cell sequence number within the message. 20 rcv_bufL_flush 502 

To transmit a message fragment: Manages the queuing and flushing of completed receive 

read command list entry, decode as outbound msg frag- buffers onto the two receive lists maintained in SDRAM 

meat (for addr generation). Buffers and status are moved into the rev and hrcv list 

recognize first, middle, last fragment of a message (for buffers (RLB and HLB) by the flush_to_Jbus function, 

outbound buffer cootrol purposes). 25 Then, the rcv_buff_flush function manages posting 

request SDRAM read access (and check packing cell requests to the ibus and the associated flushing of the RLB 

buffer availability). and HLB. 

wait for granting of SDRAM resource. msg_rxv__and_jcb_Jill 503 

if buffer available, begin transferring data bytes/words Moves data from bus into the ICBs. Writes the ICB tags. 

from SDRAM to cell buffer. 30 Performs receive filtering (perhaps), 

continue to move data to cell buffers (with data flow flush_jto_Jbus 504 

control). Reads ICB tags and performs ICB flush to IBUS. Updates 

maintain cell buffer byte count and buffer status to imple- CSB cell count field and determines when an entry moves 

meat flow control. from the CSB to the RLB or HLB. Writes RLB entries based 

pack and align data within cells. 35 on CSB and KB tags. Checks cell sequence and maintains 

generate message address for bus (including first, last, channel status—may drop ICBs and report error conditions, 

sequence information). gen_Jcb_flush__addr (within flush_jo_Jbus 5*4) 

generate bus transfer byte count field (depends on size of This function takes the bus channel status RAM contents 

buffer flush). and conditions them to create an ibus address for flushing 

queue cell for flush (ie., bus transmit). 40 one of the ICBs. At the same time; the cell count associated 

arbitrate for bus interface resource (other functions may with the logical bus receive channel is incremented for write 

request bus transfer). back into the channel status RAM or into the rcv_Jist buffer 

wait until bus interface granted. RAM. Some registering logic may be required in this path, 

arbitrate for ownership of bus. since the CSB is being modified as the flush occurs, 

move data words from cell buffer to bus interface (with 45 The geLJree_Jbufs block 500 generates addresses and 

flow control). requests for management of the free list buffer 505. Thus, 

generate ox check outbound data parity. outputs of the block 500 include the free list buffer read 

complete burst write on bus. address on line 506, the free Hst buffer fill request on line 

log cell transmit status (success/fail). 507, and the free list buffer addresses on line 508. In 

tree cell buffer for more outbound data. so response to requests from the gcCJree_bufs block 500, the 

move mere data from SDRAM into cell buffer. free list buffer data is supplied from the intermediate bus od 

continue this process until fragment move is complete. line 509 to the free list buffer 505. Data from the free list 

update command list pointer (indicates transfer buffer is supplied on line 510 to the channel status buffer 

complete). 511. This process is managed by the buffer allocation block 

To transfer a complete message: 55 501, which maintains the watermark registers and the chan- 
process multiple fragments from command list as detailed nel status validity registers. The channel status buffer out- 
above (a message may be a single fragment). puts are supplied on line 512 to the flush__to_Jbus block 
pack fragments into continuous cells without gaps. 504. Also, addresses from the flush_to_Jbus block 504 are 
flush partial cell buffer when message ends. supplied on line 513 to the channel status buffer for accesses 
notification of message sent 60 to it 

FIG- 16 shows the structure of the Me ssage Receive Logic The rcv__buff_0ush block 502 manage s the high priority 

Block 410 of FIG. 35. Any data transfer bound for SDRAM receive buffer 514 and the normal priority receive buffer 

moves through this logic Message and non-message trans- 515. This block manages the receive buffer validity, and the 

fers are treated differently: cells which are part of a message tail registers for the receive buffers. Outputs of this block 

transfer are moved into the SDRAM receive buffer structure, 65 include receive buffer addresses on line 516, the receive list 

while non-message cells do not move into receive buffers— addresses on line 517, receive list length value on line 518, 

they are written to a specific physical SDRAM address. and a receive list flush request on line 519. 
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The incoming data path from the system bus is driven flush^pcb 607 

across line 52# to the incoming cell buffers generally 521. Read PCB tags. Generate request to either ibus or system 

Write addresses for the incoming cell buses are supplied on bus to flush PCBs. Write PCB tags empty once transfer 

line 522 from the insg^cv_and_Jcb_flll block 5*3. Block completes. 

503 receives the addresses from the system bus on line 523 5 Data from the system bus comes into the packing cell 

and generates the incoming cell buffer addresses on line 522. buffers across line 609 into a byte packing mechanism 610. 

Also, the block 503 manages the incoming cell buffer tags mc byte mcdunism , the data is supplied on 

. , #1 t „ « , . line 611 into the packing cell buffers 608. Also, data may be 

• . i t C "T?^ ^""'-k t ™ supplied to the packing cell buffers from the internal bus 

M bus jmder contra of flush_to^s Wock 504. 10 ^ line 612 coupled to the byte packer 61* a n d 

Tnis block receives the channel status on line 512 and buffer . # . T. . JL >r \L 4 . 

size information on line 525. It generates the read addresses ^^ll *w T 

for the incoming cell buffers on line 526 and causes a flush supplied from the packing ceil buffers on line 613 either to 

of data on line 527 to the local SDRAM. This block also mc s y staa bus or to the internal bus, as required, 

generates the incoming cell buffer flash request on line 528, 15 ^ command buffers are filled from the internal bus 

the flush address on line 529, an the flush length value on across ^ c * 14 ** addrcsscs supplied on line 615 from the 

line using two control signals 530 for management of the filL.db module 605. The fill_clb module 605 also generates 

flush to the local memory. the fill requests on line 616 and the read addresses on line 

FIG. 17 shows the structure of the Command List Data 617 to support fills and reads of the command list buffers. 

Transfer Logic The MPC transfers data accordirig to com- 20 Also, the nJL_clb module 605 manages the elb head register, 

mands placed by software onto one of two command lists the db tan register, the elb start register, and the elb size 

(NCLB 600, HCLB 601) or onto a high priority one-shot registers. 

command buffer (OSCB 602). All data transferred under Command list data comes out of the command list buffers 

command list flows through the packing cell buffers 608 across line 618 into the parse__cmd module 604. This 

PCBs, and both source and destination (fill and flush) may 25 module supplies the source and destination addresses and 

be either system bus or internal bus. necessary flags on line 619 to the filLpcb module 606. The 

The major functions of mis block are summarized as fiU_pcb module 606 generates the internal bus fill address 

follows: on line 620 and the internal bus fill request online 62L Also, 

parse_cmd 604 it generates system bus fill addresses on line 622 and fill 

Read entries from the CLBs 600, 601 and OSCB 602. 30 requests online 623. Further, it loads the packing cell buffer 

Determine which command to next process. Associate mnl- tags 624 with appropriate data across line 625. These tags 

tiple CLB entries and handle as a single message (cause are read by the flush_pcb module 607 which manages 

packing to occur). Move address entries to fill_pcb module flushing of the packing cell buffers. This module 607 sup- 

606. Write CLBs invalid once entries are processed. Flush plies the read addresses for the packing cell buffers on line 

entries for a message mat hits an error condition. 35 627, and issues the internal bus flush requests and flush 

fill_clb 605 addresses on line 628 and 629, respectively. Also, system 

Generate ibus request to get next block of CUB entries. bus flush requests and flush addresses are supplied on lines 

Mark CLBs valid as they are successfully filled. 630 and 631 from the flush_pcb module 607. 

filLpcb 606 Hie message passing controller acts as a channel device 

Generate request to either ibus or bus to read data 40 on the internal bus operating according to the protocol of 

(through byte_packer) into PCBs. Flow-control filling of mat bus. The types of message passing controller initiated 

PCBs. Write PCB tags. transfers on the internal bus are detailed in Table 1. 

TABLE 1 
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Ave Usi buffer needs to always 
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(16 entries) 
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K> new receive activity 
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needs to flash to mete room 


HRBQ 


(there an two, normal 


max 16-word 
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SDRAM) 


word partial 
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SDRAM) 
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large bufiiaa wiU 








men most transfers 








buret full 16 words 
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TABLE 1-continued 




MPC -initiated IBUS transfer types 




type of transfer 


■ ■ ■ ■ i,i t i *t 

CXpOCtCa ■CQgll 


i priority fee bora 


priority 


event 


single- word wr 
SMC 


ite to latency of intetevent to 

processor is a factor, but there 
ia do danger of loss or overrun 
here 


REQ 



code *n*Ji**wi with data xfer 



0 


no cAwnt 


n/a 


1 


eouDOsnd co^ry oomplctte 


if data write to 




(wfaen ■md'fifg*** 1 hit act) 


adram involved 


2 


receive bat write (new receive tail) 


yes 


3 


rant trap asaatioQ (system boa cycle) 


no 


4 


oat of receive feaffen 


DO 


5 


rystexn bos error conrtttkn 


no 


6 


dropped hi-cell-pricrity system boa ceU 


no 


7 


spare 





Event/interrupt signals from the MPC are supplied to the in-band interrupts and events 
local processor through the shared memory controller SMC, This amounts to a register access from a channel device 
according to the internal bus protocol* to the SMC (the SMC sums events and interrupts into 

The conditions set out in Table 2 seed an "event" from the 15 registers providing up to 32 bits per channel device). The 
MPC to the SMC (causing a hit-set operation in the SMC's MPC does not use mis inechairisTn, but would do so if bus 
out-of-band event register): interrupt and event receives were implemented. 

Registers in the MPC arc listed below with detailed 
TABLE 2 description of their function. The name will be given first, 

20 men in parentheses the address offset is stated in hexadeci- 

mus out-of-band event codea mal. The size of each register will be given along with a 

description of the register's function. Unless stated 
otherwise, assume that the register is R/W. Unless stated 
otherwise, assume that all registers are set to zero when the 
25 MPC comes out of reset 
1. System Registers 
Slot Number (0000) 

This is a 4-bit register providing the encoded slot number, 
from 0 to 16. 
„ Arbitration and Priority ID (0004) 

This 4-bit register provides a device with an encoded 
arbitration ID. The priority bit used by the device is deter- 
mined by adding 16 to die arbitration ID. This use of priority 
The MPC makes efficient use of the SDRAM memory is enabled by device specific means, 
system. This means that the MPC win read up to lr>words _ Arbitration Mask (0008) 

across the internal bus and the transfers inay cross 16-word- This 16-bit register is used to mask (AND) arbitration/ 
aligned boundaries. Any number of words up to 16 may be priority levels on the bus. Thus, 0*s are set in every bit 
read from any word-aligned address. The SMC is expected corresponding to non-existent cards, and l*s arc set in every 
to deal with SDRAM page crossings, so the MPC need not bit corresponding to existing cards. Thus, all devices must 
account far or track that case. If the SMC needs to put ibus drive both arbitration and priority lines during every arbi- 

wait-stalraouttodealwimro 40 tra£ion < P^ ase - 

rmrJemgnt ibus wait-states (as instructed via a wait com- Revision Register (000Q 

nmnH code from the SMC chip). In the case when a long ^ 4-** read-only register gives a revision number for 
transmit buffer is to be read, the MPC will shorten the first «w Core bus device, 
ibus bum read so that subsequent Core Bus Device Type (0010) 

be 16-word aligned bursts. 45 This 8-bit register gives a hard coded bus device type, 

The following categories of events and interrupts arc Efferent core bos devices will have Afferent register 
iimlcrncntcd indie MPC configurations, so software must check the vame in this 

~T w _ register before attempting to program the device. The CMC 

^TUs is a non-maskable trap signal to the local processor: *» g^^^f ^ *e MPC will be set at 2. 

Ahus ^wri* >to ^\^^ 1 caBS « (bc50 This 3-bit register indicates how long to wait when a 

MPC to assert a warn signal directly to the local processor. \ ja ^i D ak^^ rocked. 

Tne MPC simply sources a signal which, at the board-level, Parily Byte (00 1C) 

is connected to the processor, This signal bypasses the IBUS This 5-bit register has one or more of its four bits set to 

and SMC. indicate which bytes of the data at the affected address 

uart_trap 55 caused a parity error. The appropriate bits in this register are 

Used for out-of-band debug, a bus write to the uart_trap written by a core bus oevice receiving core bus data with bad 
register causes the MPC to send an event to the SMC (via parity. These flags are read only. The lowest 4-bits indicate 
out-of-band event mechanism on the ibus), which In turn a data parity error, wmle the highest bit indicates an address 
asserts a trap signal to the local processor. parity error. The lowest bit is associated with the data byte 

channet^device_Jntejittpts 60 on D0-D7, and the fourth lowest with the data on D31-D24. 

This class of events uses the ibus out-of-band event Address Geiierating Parity Error (0020) 
mechanism including the event_tag field in the upper nibble This 32-bit register holds the address which had parity 
of the ibus address. This is used by the MPC to notify the error problems, 
local processor of command completion, dropped cells, bus Backoff Counter (002Q 

error, illegal command entry, etc In the SMC , each event 65 This 4-bit read/write register gives a count of the number 
may be set as a polled event or summed into an i n t r signal of backoffs received by this chip. An error is generated by 
to the processor. the chip when 16 backoffs in a row are received. 
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Corebus Device Configuration (0030) 
This 5-bit register holds the reset and enable bits shown 
in Table 3: 

TABLE 3 



Cort^ Device Coofiguration 

Bits Description 



4 BRES - This bit is used to react IOP board (See the "ReseT 
chapter for additional detail) 

3 SCRES - When this bit is act it initiates a cold reset. A cold reset 
leiutxaHxes all values to be identical to power-up except that the 
error state inibnnatkm ia saved. This bit can also be set as a side 
effect of the Corebus ERR bit being set more than 24-ckxfc period 

2 SWRES - When this bit is sot it initiates a warm reset A wm 
reset stops operation of the device and returns it to a known free 
and idle state, disabling operation, but does not reinitialize the 
ralnes of registers. The SWRES bit can be set by the ERR signal 
being asserted mote than 12 clock periods. 

1 ARBE - This enables the devxe to ckive its arbitration bit on the 
Corebus. Note thai driving its arbitration bit is not the same as 
asserting its arbitration bit. 

0 CBB - This enables the devkx to transmit over the Corebus. When 
disabled the device may still participate in arbitration. 



Care Bus Error Status (0128) 

This 10-bit register provides error bits to guide the 
software when it receives an error interrupt as shown in 
Table 4. Any bit set causes the error interrupt to be 
requested 

TABLE 4 



Error Status Register 

bits Description 

0 Has bit inrtiratra that a Cow boa time ont occured 

1 This bit indicates that • backoff fotry sequence wa« not successful. 
7:4 Theae bits indicate a parity etror occurred on data sotnoed from the 

Core bos. S these bits an set it may be in tandem with bit 9 
(processor read) or Core bus agent write. 

8 Tins bit indicates that an adkeaa parity error occurred. 

9 This bit indicates whether the tut cycle that had an error was a 
write from another device or a read by this device. 



2. list Registers 

There are a group of registers which can be described as 
list registers. There are registers for the free list, normal 
priority command list, high priority command list, normal 
priority receive list, high priority receive list Each will have 
start, size, head, and tail registers. The start and size registers 
will be set daring initialization by software. Initially both the 
head and tail registers will be set to 0. The MFC will be 
continually updating the head register. The software win 
occasionally read the head register and set the tall register 
(not necessarily at the same time). From the perspective of 
the MFC the head pointer will always be current while the 
tail pointer may be stale (being stale does not mean that it 
cannot be used, it means that the current tail pointer may be 
old). 

2*. Free List Registers 

The free list registers have a series of pointers associated 
with it The start pointer points to the beginning of the free 
list The start+size will point to the location just below the 
bottom of the free list The head pointer indicates the 
location in memory where the hardware removes the entries 
from the list This pointer is set by the hardware. The 
software will have to query the hardware to get mis infor- 
mation. The tail pointer points to the next location that 
software will allocate new free list pointers. FIG. 18 shows 
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the free list structure and its associated registers in the MPC 
and the free buffers in the SDRAM. 
Free Head Register (0200) 

This is an 11-bit register which holds part of the address 
s of a pointer in the free buffer list (in SDRAM) which points 
to the next free receive buffer (in SDRAM) to be loaded into 
the free list buffer (in MPC). The bottom 6 bits of the 32 bit 
address are not included because the free list entries are 
transferred in 16-word aligned blocks. The top IS MSB's of 
io the address are not included because it will never change and 
is specified by the free start register. The value of the free tail 
register must be 1 or more higher than the value of the free 
head register for the MFC to use the entries specified by the 
free head register. If they are equal it means that mere are no 
15 valid entries available. If the free tafl register is smaller thap 
the free head register, it means that the free tail register must 
have already wrapped around the bottom of the free buffer 
list and started from the top again. This means that it is 
alright to transfer the pointers to the free buffers into the 
20 MFC's free buffer list Reads to this register will behave 
differently than writes because during writes the entire 32 
bits of the address will be valid. This address is generated by 
concatenating the bits [31:17] from me free start register, the 
merge of bits [16:10] of the start with bits [16:10] of the free 
25 head, the bits [9:6] of the free head register and bits [5:0] are 
padded with 0's. 
Free Tail Register (0204) 

This is an 11-bit register which holds a portion of the 
address of a pointer in the free buffer list (in SDRAM) which 
30 will point to the next free buffer as determined by software, 
like the free head register, the bottom 6 bits of the 32-bit 
address are not needed since software will be assigning 16 
buffers at a time and the top MSB's of the address will not 
be needed since they will always be the same as the free start 
35 register. Once again, the reads to this register will behave 
differently than writes (see the free head register definition 
for additional information). 

Free Start and Size Register (0208) 

This is a 30-bit register which holds the 22 MSB's of me 
40 address of the top of the free buffer list (in SDRAM) and 8 
bits of size information. A size value of 00000001 will 
correspond to the minimum normal priority command list 
size of 256 entries, 00000010 corresponds to 512 . . . 
10000000 corresponds to the maximum normal priority 
45 command list size of 32768. 

Free Watermark 0 Register (Q20Q 

This 11-bit register stores the count (xl6) of valid entries 
in the free list below which the hardware will have different 
characteristics knowing that the number of entries in the free 
50 list is getting low. The MPC will start dropping medium and 
low reliability cells when the free buffers are less man the 
number indicated by this register. 

Free Watermark 1 Register (0210) 

This 11-bit register is similar to the free watermark 0 
55 register, just replace "0" with T. The MPC will start 
dropping low reliability cells when the free buffers are less 
than the number indicated by this register. 

2.b. Command List Registers 

The command list registers are very similar to the free list 
60 registers. Both need to get information off a list while 
keeping track of where to get the next element of the list and 
the location of the end of the list For the command list 
registers a watermark register will not be necessary. (Thus 
generating the difference between the head and tail register 
65 will not be necessary, just an equality check to see if we are 
out of cornrnands.) The MPC will assume that the software 
will update the command lists 4 commands (16 words) at a 



04/30/2004, EAST Version: 1.4.1 



5,802,278 



23 



time. If the software cannot fill die 4 commands, it will put 
the null command in the next empty command field. 
Normal Priority Command Head Register (0214) 
This 11-bit register is identical to the tree head register; 
just replace **free n with "normal priority command." 
Normal Priority Command Tail Register (0218) 
This 1 1-bit register is identical to the free tail register, just 
replace "free" with formal priority command " 
Normal Priority Command Start and Size Register (02 1Q 
This 30-bit register id identical to the free start and size 
register; just replace "free" with "normal priority com- 
mand.*' 

High Priority Command Head Register (0220) 
This 1 1-bit register is identical to the free start and size 
register, just replace "free* with "high priority command." 
High Priority Command Tail Register (0224) 
This 11-bit register is identical to the free start and size 
register; just replace "free" with "high priority cornmand.'* 
High Priority Command Start and Size Register (0228) 
This 30-bit register is identical to the free start and size 
just replace "free*' with "nannaT priority corn- 



Normal and High Priority Command Head Register 
(022Q 

This 22-bit register holds the contents of both the normal 
priority command head register and high priority command 
head register This is to allow transfers of the command head 
registers in one 1-word transfer. This register is a "phantom** 
register which points to the two "real** registers which 
actually holds the information. 

Normal and High Priority Command l^il Register (0230) 

This 22-bit register holds the contents of both the normal 
priority command tail register and higi priority command 
tail register. This is to allow transfers of the command tail 
registers in one 1-word transfer. This register is a phantom" 
register which paints to the two H real M registers which 
actually holds the infonnatioa. 

2.c Receive list Registers 

The receive list registers are similar to the command list 
registers. Hardware writes the receive Hst entries to the 
location pointed to by the receive tail register. The receive 
list register's head register is sot needed because software 
wfli never give hardware enough receive list entries for the 
tail to overrun the head. Hie receive list tail register must 
have a higher resolution than the other list tail registers since 
there will no longer be requirement of the 16 word transfers. 
Normal Priority Receive Tail Register (0234) 
This is an 11-bit register which holds a portion of the 
address of a pointer in the normal priority receive list the 
top 15 bits of the 32 bit address are not needed since they 
will be the same as the normal priority start register. The 
bottom 3 bits are not needed since they will always be 0 
since the descriptors to the receive buffers will always be 
sent in 2 word increments. This register will wrap around 
back to 0 when it has exceeded the size of the Hst 
Normal Priority Receive Start and Size Register (0238) 
This is a 32-bit register which holds the 22 MSBs of the 
address of the beginning of the normal priority receive list 
space (in SDRAM) and 8 bits of size information. A size 
value of 00000001 will correspond to the minimum normal 
priority command list size of 256 words, 00000010 corre- 
sponds to 512 . . . 10000000 corresponds to the maximum 
normal priority command list size of 32768 words. 
High Priority Receive Tail Register (023Q 
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High Priority Receive Start and Size Register (0240) 

This 30-bit register is identical to the normal priority 
receive start and size register; just replace "normal priority" 
with "high priority.** 

Receive Butler Size Register (0244) 

This 8-bit register (N) holds the information about the size 
of the receive buffers in the SDRAM. The size of the buffer 
will be N*64 bytes except when N=0. When N=0 the size of 
the buffer is 16348 bytes. Table 5 provides the encoded 
values stored in the register and the corresponding size 
represented by that encoded value. 

TABLE 5 



60 



Receive Buffer 


Size Register Decode 


Encoded Value 


Size of Bofler m Bytw 


0OO0OQ01 


64 


00000010 


128 


00000011 


192 


00000100 


512 


aniiio 


16256 


11111111 


16320 


00000000 


16384 



30 3. Miscellaneous Registers 

immediate Bus Error Status Register (0248) 

This 32-bit register holds the error status information. 

Miscellaneous Register (024C) 

This 7-bit register holds the tie, rxe, pxe, cmd_check^_ 
enable, set_cb _jeset_reg_, cb_master_reg, and 
loopbadOrjru_cb bits having the functions described in 
Table 6" below. 



40 



TABLE 6 



Receive Buflfer Size Hectare 



BitNsu 



45 



SO 



55 



0 k»pback_tfanj-cb 



1 QHt bopbock uuuu fbrotgb the Core 
bos interface, 0 imos that the Ccxo boa 



cmd_checfc_«Mble 



4 pie 



4 lie 



1 indicate (hat this MFC it the motor of the 
Cote bua. 

If this bit b est, k win some % cb_jeaeL 
ff tiris bit Is set, error checking an the 

Uris bit tariktftfw whether (be test cycle thai 
had act Biror was a write from mother device or 
a read by this device. 

This bit is the receive enable. If it is set then 
the MFC willing to accept data ttmate. 
Ibis bit is the transmit enabieL Ff it la act then 
the MFC is able to tend data transferi 



U ART Registers 

The uart_regi*ter function provides a path far "out-of- 
band" communication between cards across the corebus. 
This feature requires software driver support (call it a remote 



monitor function, or whatever). Another card may access 
This 14-bit register is identical to the normal priority 65 registers in the MPC's corebus kit space. The local proces- 
sor also has access to this register set, facilitating board-level 
communication. 



receive tail register; just replace * 4 normal priority" with 
"high priority. 
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TABLE 7 
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Error Status Register 



bits Description 
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30 



35 



0 This bit indicates thai a Coic bus ^ out occurred. 

1 This bit indicates that a backoff retry sequence was not successful 
7:4 These bits indicate a parity error occurred on data sourced from the 

- Core bus. If these bits ate set in may be m tandem with bit 9 
(processor read) or Core bus agent write. 

8 This bit indicates that and address parity error occured 

9 This bit indicates whether the last cycle that had and error was a 
wrifcd from another device or by a read by this device. 

4. Pointer Lists and Address Fields 

Pointer lists and memories reside within the MFC. Three 
types of pointer lists exist the command list, the free list, 
and the receive list These lists allow software to commu- 
nicate to hardware the whereabouts of various buffers within 
SDRAM. 

The SDRAM memories within the MPC, aside from the 20 
cached pointer lists, provide a storage area for inbound and 
outbound data as well as address buffer locations. 

Each cell transf erred over the bus has an address field. The 
information within these fields relates to iiiformarion soft- 
ware supplies to the hardware via the command Hst 

The pointer lists and memory structures of the MPC as 
well as information contained in a cell address field are 
outlined below. 

4.a. The Pointer Lists 

The Command List 

The command list consists of an array of four-word 
entries stored in SDRAM which contain instructions from 
the software to the hardware. The instructions may ask 
hardware to gather, pack, and move data between SDRAM 
and COX shared memory, source an interrupt or event to the 
bus, or read/write a word of data to bus I/O or memory 
space. A portion of th e command list will be cached within 
the MPC. The cache spans two groups of 2x16x32 bits. 

The possibility exists for three types of command Hst 
entries. One type of command list entry points at data in a 
message fragment buffer for incorporation into a message 40 
transfer. A cell which is part of a message transfer is 
prepeoded with a message address field. The second type of 
command list entry points at data u a noiMne^sage fragment 
buffer for ^corporation into a non-message transfer. A 
non-message transfer cell uses a non-message address field «3 
as its prepended cell header. The third type of transf er is a 
type of non-message transfer except in this case there is no 
fragment buffer. One word of data is written to the bus 
memory or I/O space. The word for writing is actually 
specified within the command list entry. These transfers are 50 
called embedded-data transfers. Embedded-data transfers, 
being a type of non-message transfer, use non-message 
address fields as their prepended cell header. 

Table 8 below shows the first six bits in a command Hst 
entry given a particular type ctf transfer FIG. 9 gives a short 55 
de scription of each type of transfer. Tables 9 and 10 state the 
meaning of the Destination and Source Code bits in Table 6. 
These bits indicate whether data is transferred to/from the 
IBus/system bus and whether the transfer is in memory 
space or in I/O space. It is intended that CBIO WRITE and do 
CBMEM WRITE (the enftedded4ata transfers) move only 
one word at a tune onto the bus. Therefore, no source 
address is needed and the data to be written may be 
imbedded in the command list in place of the source address. 
This is indicated with a source address code of 21)00. 

Special care must be taken when a command list entry 
specifies the movement of data with a destination address in 



local SDRAM. Software needs a reliable method for deter- 
mining that that type of transfer has actually completed (the 
data is actually in local SDRAM). To do this, the MPC 
hardware will automatically block cornrnand list processing 
(not bump the head pointer) until data bound for SDRAM 
via a non-message transfer has successfully flushed across 
the ibus. Also, any event associate with this entry (specified 
by a command list notify bit; see below) will not be sent until 
the write to SDRAM is completed This allows the software 
event handler to read head pointers to determine which 
entries are actually complete once an event is received (since 
there could be several entries causing events quite close 
together, head pointer managemeDt is critical). 

TABLE 8 

Allowed O n m i n i h I List Transfers 
Dest Dest Src. Sic. 
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Code Coda 
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MSG XMTT 
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0 


1 


mag transfer 
msg transfer 


SMA READ 


1 


0 


0 1 


1 


1 


SMHMREAD 


0 


0 


0 1 


1 


1 


non-mag bans 


SMBMWRTTE 


O 


0 


1 1 


0 


1 




MEM MOVE 


0 


0 


0 1 


0 


1 


xm-mas trans 


CELL. XMTT 


1 


1 


1 0 


0 


1 


mag-transfer 


CBIO READ 


0 


0 


0 1 


1 


1 


non-msg trans 


CBIO WRTTB 


0 


0 


t 0 


0 


0 


emibeddoo^lahi 


<TKMKM 


0 


0 


1 1 


0 


0 


trans 
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TABLE 9 




Sonne Codes 






Source 






Code 


Source Code 


Word 1= DATA 


0 


0 


tBus Memory Spac 


e 0 


1 


CBVO Space 


1 


0 


CB Memory Space 


1 


1 


TABLE 10 










DestioBifan 


TtflwffnalioB, 
Code 


Dlegal Code 


0 


0 


(Sua Memory Space 


0 


1 


CBVO Space 


1 


0 


CB MraiK»y Space 


1 


1 



Command List Priorities 

Two exunmand list caches exist within the MPC Servic- 
ing priorities between the two lists varies: normal priority 
(HTQ: high^hroughput queue) and high priority (HRQ; 
high-reuaoiHty queue). 
Normal Priority Cornrnand list (software: HTQ) 
The normal priority command list resides in SDRAM. 
Thirty-two words from this list may be cached in SRAM in 
the MPC ASIC normal priority command list buffer. Entries 
written by software to this list receive the lowest priority 
attention in regards to hardware processing. This list may 
65 contain pointers to both message and non-message fragment 
buffer entries as well as hold embedded-data transfer 
instructions. 
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High Priority Command List (software: HRQ) 
As with the normal priority command list, the high 
priority list also resides in SDRAM. Thirty-two words of 
this list may be cached in SRAM in the MPC ASIC high 
priority command list buffer. Entries written by software to 
this list receive a higher priority attention by hardware than 
entries on the normal priority list This list may also contain 
pointers to both message and non-message fragment buffer 
entries as well as hold embedded-data transfer instructions. 
Command list Entries 

Command list Entry— Message Fragment Buffer 
FIG. 19 defines the bits in a command Est entry pointing 
at data which will become part of a message transfer. 
A description of the fields found in FIG. 42 follows: 
The T in bit 31 of Word 0 stands for Type. If Type is set 
to a one, the command list entry specifies a message transfer, 
if type is set to a zero, the command list entry specifies a 
non-message transfer. 

The C in bit 30 of Word # indicates to hardware that this 
particular command list entry specifies a CELL XMTT 
transfer. Hardware will know not to change the "Y" bits in 
Word 2 but to copy them directly to the message address 
field. 

Hie D CODE[2*28] of Word 0 indicate to hardware 
whether a transfer is destined for the bus of the ibus and 
whether or not that transfer is in VO space or memory space. 
These bits refer to the address in Word 2, the destination 
address. 

The S CODE(27:26] of Word • indicates to hardware 
whether the data transfer is sourced from the system bus or 
the ibus and whether the address is in VO space or memory 
space. In the case of an embedded-data transfer, these two 
bits will indicate that the data to be written is held in Word 
L These bits, then, refer to the address in Word l t the Source 
Address field. 

F stands for First in bit 25 of Word 0. If the memory 
location to which this command list entry points is the first 
buffer in a series of buffers which will combine to form one 
data transfer, then F will be set to a one. Otherwise, F will 
be zero. 

likewise, the Lin bit 24 of Word # stands for Last If the 
buffer to which mis command list entry points is the last in 
a series of buffers which combine to form one data transfer, 
then L will be set to a one. Otherwise, L will be zero. 

The V in bit 23 of Word 0 holds the valid bit This bit 
indicates that a command list entry requires hardware pro- 
cessing. (V=l indicates processing needed; V=0 indicates 
processing not needed). If a particular command list entry 
shows a valid bit of V=0, hardware will assume that the 
remaining command list entries in the same cell are also 
invalid. Hardware will resume valid-bit checking at the 
beginning of the next cell of command list entries. 

The lower two bytes in Word I contain the number of 
bytes of data in the buffer to which this command list entry 
points. 

Word 1 specifies the physical memory address where the 
data buffer resides. This address may be either local 
SDRAM or shared memory on the COX card. 

The top 2S bits of Word 2 contain fields which are 
hit-aligned to those in the message address field. The hard- 
ware will append the bottom four bits to this 2S-bit field 
thereby creating the message address for all transfers besides 
the CELL XMTT. In this case, whatever software specifies in 
the command list entry will be directly copied into the 
message address field The individual fields in Word 2 are 
described in detail with reference to FIG. 44. 
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Word 3 will not be processed by the MPC ASIC. 
Command list Entry— Non-Message Fragment Transmit 
Buffer 

FIG. 20 defines the bits in a command list entry pointing 
5 at data which will become part of a non-message transfer. 
The command list entry for a non-message data transfer 
resembles mat of a message transfer. Note that the Type bit 
(Word 0, bit 31) will be set to zero for a non-message 
transfer and Word 2 will be a physical memory location in 
10 SDRAM or shared GEC memory. The other fields in FIG. 20 
remain the same as those of FIG. 19. 

Recall that an embedded-data transfer is really a type of 
non-message transfer (meaning that the Type bit— bit 31, 
Word 0 — is set to 0). An embedded-data transfer may be 
15 distinguished from other types of non-message transfers by 
decoding the S CODE bits which will be set to 2*b00. With 
this type of transfer, Word 1 wOl contain the data for writing 
instead of a physical source address. 
Command list Transfers 
20 TM? figrrinn ciimrry>ri7i»g ttw typfcg nf transfers Initiated hy 

command list entries as introduced with reference to FIG. 9 
above. The information given below for each type of transfer 
refers to fields found in the command list entry as described 
above. Write and read are in relation to the bus, Le., one 
25 writes to the bus or one reads from the bus. 
Message Transfers 

The following transfers are referred to as message trans- 
fers because their destination address is in message format 
(Word 2 of command list entry). Address decoding maps bus 
30 VO space addresses 0x(8 or 9) XXXXSSS as message 
addresses. The S CODE bits within the command list flags 
indicate whether to retrieve the source data from the core bus 
or from the I-Bus (see Table 16). 

MSGXMH — 

35 MSG XMIT transfer request on the command list asks for 
the transfer of data from the SDRAM of the local JO? to the 
SDRAM of another IOP. The command list entry paints to 
a message fragment transmit buffer, 

Wcrt»[31:26]=<5*bl01001 
40 Source address (Word l)=local SDRAM: 
0x9XXXXXXX (I-Bus memory space) 

Destination address (Word 2)=message address: 0x(8 or 
9)XXXXXXX (system bus VO space) 

SMAREAD — 

45 This type of transfer moves data from shared memory on 
the CEC to local SDRAM on the IOP. Data is massaged by 
the MPC to resemble a MSG XMTT transfer, i.e., incoming 
data is prepended with a message address field so hardware 
will utilize the receive tist for notifying software of data 
so entry. 

Word0031:26J=6*bl00111 
Source address (Word l)=COX shared memory: 
OxXXXXXXXX (system bus memory space; limited by 4 
MB of addressable memory cm COX) 
55 Destination address (Word 2)=message address: Ox(S or 
9)XXXXXXX (system bus VO space) 
CELL XMIT — 

A CELL XMTT data transfer is much like a MSG XMIT 
except software has explicit control over the message des- 

60 tinadon address and may only transmit up to sixteen words 
per command list entry (one cell). This implies that hard- 
ware will not alter the bottom four bits of Word 2 in the 
message fragment buffer command list entry when placing 
them into the message address field. This type of transfer is 

65 used for diagnostic purposes only. Note that bit 30 of Word 
0 in the command list entry will be set to C=l as an 
indication to hardware that the entry is a CELL XMIT entry. 
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Word0[31^6]=6'blll001 

Source address (Word l)=local SDRAM: 
0x9XXXXXXX (I-Bus memory space) 

Destination address (Word 2)=message address: 0x(8 or 
9)XXXXXXX (system bus I/O space) ■ 

Non-Message Transfers 

The following transfers arc referred to as non-message 
transfers because the destination address of each command 
list entry refers to a physical location in either local SDRAM 
or COX shared memory. 

SMEM WRTTE — 

This transfer moves data from the SDRAM of the local 
IOP to shared memory on the COX. 

Word»[31^6}=6Vb001101 

Source address (Word l)=local SDRAM: 
0x9XXXXXXX (I-Bus memory space) 

Destination address (Word 2)=shared memory: 
OxXXXXXXXX (bus memory space; limited by 4 MB of 
addressable memory on COX) 

SMEM READ — 

The SMEM READ data transfer moves data from shared 
memory on the COX to local SDRAM on the IOR Data 
bypasses receive list mechanism in the MFC and is written 
directly to SDRAM. 

Word0[31;26]=6 , b000111 

Source address (Word l)=COX shared memory: 

OxXXXXXXXX (bus memory space; limited by 4 MB of 
addressable memory on COX) 

Destination address (Word 2)= local SDRAM: 
0x9XXXXXXX (I-Bus memory space) 

MEM MOVE— 

This type of transfer moves data out of and back into local 
SDRAM on the IOR Data transfer, therefore, bypasses the 
bus. 

Word0[3 1 MJstfbOOOlOl 

Source address (Word l)»local SDRAM: 
0x9XXXXXXX (I-Bus memory space) 

Destination address (Word 2)=local SDRAM: 
0x9XXXXXXX (I-Bus memory space) 

CBIO WRITE 

This type of non-message transfer is termed an 
embedded-data transfer since one word of data is written to 
the bus memory space by placing this data in Word 1 of the 
command Hst entry. 

Wordfl[3 1 :26>6*b001000 

Source address (Word l)=data for writing (unrestricted) 
Destination address (Word 2)- bus I/O space: 
OxXXXXXXXX 
CBMEM WRITE 

This type of non-message transfer is termed an 
emteddedVdata transfer since one ward of data is written to 
the bus memory space by placing this data in Word 1 of the 
onmtT^y i list entry. 

Word0[31:26>6'b0001100 

Source address (Word l)=data for writing 

Destination address (Word 2)=COX shared memory: 
OxXXXXXXXX (memory space; limited by 4 MB of 
addressable memory on COX) 

The Free list 

The MFC must place data entering an IOP into the 
SDRAM. The software ronraronicates to the hardware loca- 
tions in SDRAM where data may be placed. These locations 
are called receive buffers. The free Hst consists of one-word 
dements which point to the receive buffers. The length of 
the receive buffers is fixed at N * 64 bytes where Ne (1, 2, 
. . . , 256). Each receive buffer is 64-byte aligned. Hie 65 
specific length used is latched in a register called receive 
buffer size. 
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Thirty-two 26-bit entries reside in the free list in the MPC 
arranged as two 16x26 bit, dual-ported SRAMs. Entry data 
are cached from the fill free list held in SDRAM. 

The Receive List 

Standard Receive Mode (RXE=1; PXE=0) 

After hardware finishes loading incoming data into 
receive buffers as allocated by the free list, the data becomes 
ready for processing by software. The hardware informs the 
software that a receive buffer needs attention by placing a 
pointer to that receiver buffer, as well as other information, 
onto one of two receive lists. One receive list indicates data 
needing normal-priority attention and the other receive list 
indicates data needing high-priority attention. As with the 
command list and the free list the entire receive list resides 
in SDRAM. The MFC buffers receive-list data in four, 
dual-ported, 16x32 bit SRAMs. Two of these SRAMs are 
dedicated to normal-priority entries and two are dedicated to 
high-priority entries. 

The following describes the entries shown in the receive 
list bit definition; 

If the start bit (Word 0, bit 31) equals one, then the 
particular buffer pointed to by this receive list entry is 
the first buffer in a series of buffers which form a 
message. 

Likewise, if the end bit (Word 0, bit 50) equals one, men 
the particular buffer pointed to by this receive Hst entry 
is the last buffer in a series of buffers which form a 
message. Note that this implies mat if neither bit 31 or 
bit 30 is set to one, then the buffer pointed to by the 
receive list entry is a middle buffer. If both bits 3 1 and 
30 are set to one, then the message is one buffer in 
length. 

Bits 16 through 23 contain the count field indicating how 
many cells are stored in a particular receive buffer. 

Bits 10 through 15 determine the channel over which the 
IOP received the message. Each incoming message is 
granted a channel number unique during its transmis- 
sion time. 

Bits 6 through 9 relate to error checking. Bit 0 will be set 
to one by hardware if any type of error occurs during 
the transmission of the message. Bit 1, labeled seq, 
equals one If the error which occurred during transmis- 
sion is a cell sequence error, ie M cells were lost, 
duplicated, or rearranged. likewise, bit 2 corresponds 
to a parity error and bit 3 is currently reserved for a 
future error status indicator. 

Word 1 points to the location in SWAM corresponding 
to the first byte of the receive buffer. Note that since all 
receive buffers in SDRAM are 64-byte aligned, only 26 
bits are required to specify (he receive buffer address. 

Promiscuous Receive Mode (RXErX; PXE=1) 

During promiscuous receive mode all bus cycles are 
captured by the MPC Via the receive list, hardware will 
convey to software the bus address, the bus byte count, the 
MEM/IO bit, an error bit, and the location of the receive 
buffer in SDRAM. 

The bits in this entry are defined as follows: 

Word 0 holds the address read off the bus during the 
address phase of the bus cycle. 

Wordl [31:6] holds the top twenty-six bits of the receive 
buffer location in SDRAM where the data associated 
with this bus cycle has been written. Note that receive 
buffers are 64-byte aligned in SDRAM therefore the 
bottom six bits of the address are icro. 

Wordl [5:3] indicates the byte count read off the bus. 

Bit 2 of Word 1 is the memory bit from the bus indicating 
whether data transfer is in cither bus memory space; 
(mem=l) or bus VO space (mem=0). 
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Bit 0 of Word 1 will be set to one if an error occurs during provide data link layer processing for basic input/output 

the bus cycle. modules on the system. This ensures backward compatibil- 

4.b. The Cell Address Field ity as well as flexibility and scalability fox the system. 

Since the command list dictates the formation of cells FIG. 22 provides highlights of major functional compo- 

traversing the bus, the address field associated with each cell 5 nents of the software according to one embodiment of the 

is intimately related to information found on the command present invention as distributed on the centralized] Processor 

list The address field on a cell destined for the bus varies COX and the intelligent input/output system IOS^which 

with the type of data transfer. MSG XMTT, SMA READ, and ™™icate *j£w<*by**y means of an mterproces- 

CELL XMIT use message address fields, SMEM WRITE, sor ro^sam^ syste^^^ 

pwrw n BAn mfm movp mm read PMO 10 ^ FIG. 22, flow within each processor, cither the COX or 

SM ™ R£ ^ v ^™„^2ir' 0310 READ ' V? 10 10 the IOS/IOP can be considered vertical in the figure, while 

WRITE, and CBMEM WRITE use non-message address wmmimication between the units is primarily horizontal and 

folds- peer-to-peer Thus, on the central networking resource COX* 

The Message Address Field software for upper layer protocols is illustrated in block 70* 

FIG. 21 defines the bits found in the one-word message ^ routed protocols ^ represented by block 7*1, with 

address field, as explained below. is other network layer protocols 702 supported as necessary. 

Any cell which has the upper nibble of its header set to 8 Below the routed protocols 70 1, are the source routing 

(1000) or 9 (1001) will be identified as a cell which is resources 709, transparent bridging resources 704, and the 

part of a message. SNL support 705. 

The R in bit 28 indicates to which receive list the cell The SNL is the sub-network layer which r^dlesparsing 

should be directed. A value of "1" indicates the high- *> of he^s to determine the next layer of rroto^i <fcr^- 

priority receive list (HRQ) and a "0" indicates the mg of rackets to ar ? >ro^ higher layer protc^handles 

lowi>rioriry^ P^fcol ^^f^™^ * J* 

„ . . - \ JV^. . «. — replace sub-network layer headers including MAC headers 

Bits 24 through 27 define the donation slot THe ceU £^^ Mahe ^ 

routes to this physical slot in the chassis. 25 protocols as FTP, FR, X.25, and SMDS. 

Bits 20 through 23 indicate the bus over which the Below the SNL support 705, transparent bridging 704 and 
message will be routed. source routing 703 are found the inbound receive demum- 

Bits 16 through 19 define the source slot The ceU placing resources 706. These resources direct the packets 
originates from this physical slot in the chassis. received from the lower layers into the appropriate upper 

Bits 12 through 15 indicate the bus over which the 30 layer modules. On the COX, the data link layer servers for 
message originated. the IOM input/output modules without remote intelligence 

Bits 10 and U show cell reliability. The cell reliability bits are provided. Also, data link layer agents for the intelligent 
work against two watermark registers, implementing I/O mrxhilrs are supported (block 707). Also, a link man- 
three levels of reliability for bus messaging as shown in agement function module LMF 708 provides queuing ser- 
Table 11. 35 vices for serial interfaces. The I/O drivers which support 

network events on the basic input/output modules, and the 
TABLE H IAD driver agents which provide services to the I/O drivers 

on the intelligent input/output modules such as the IOS and 
IOP are also included on the centralized processor in block 
40 709. A port and path manager PPM 710 is included, which 
handles tnupping between logical ports and physical paths. 
These modules communicate with resources distributed 
across the mterprocessor messaging system IMS 715 to 
components located on the input/output modules. For the 
45 IOS or IOP modules with intelligent resources located on 
card, they communicate with the modules illustrated in FIG. 
Bits 4 througji 9 determine the channel over which an IOP 22. Thus, in the upper Layer distriWrtH protocol modules 
is receiving a message. Each incoming message is 716 m found> include transparent bridging, source 
granted a channel number unique during its transmis- routing and routed protocol support, and also pass through 
sion time, 50 resources so mat packets not sm^orted locally can be passed 

A one in bit 3 indicates mat this is the first cell in a through the IMS 715 to the ™>ntr»Kwi processor. A SNL 
message, remote driver 717 is also included on the IOS/IOP. The 

A one in bit 2 indicates that mis is the last cell in a distributed protocol rnodiue 716, and the SNL remote driver 
message. 717 receive data through the inbound demultiplexer 718. 

Bits 0 and 1 allow room for a sequence namber applied to 55 The data link layer resources 719 which are executed on the 
each cell in a message. Cell sequencing takes place remote devices supply the inbound receive demultiplexer 
module 4, 718. An outbound queue manager 720 is used for managing 

The Non-Message Address Field transfers out of the local card. I/O drivers 721 drive the 

Cells which combine to form a non-message data transfer input/output devices ooupled to the IOS/IOP card- A port and 
use physical memory locations in SDRAM or COX shared tio path manager PPM 722 for the remote device is also 
memory far their address fields. included on the remote card 

HI 1NTERPROCESSOR MESSAGING SYSTEM (IMS) The mterprocessor messaging system (IMS) 715 provides 
The system description and message passing controller a logical platform which allows communication between the 
technique described above supports a very flexible and central resource COX and a wide variety of remote 
scalable architecture for the router, with die distributed 65 resources across the common logical layer interface. Thus, 
protocol modules on intelligent I/O modules, centralized the intelligence of the cards within the routing system can be 
resources shared by all die I/O modules, and the ability to varied and flexible as suits the need of particular installation. 
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FIG. 23 breaks down the intoproccssor messaging system delivery service also using the HRQ. For a given queue, the 

into additional components which are centered across the IMS guarantees that packets will be delivered on the recipi- 

core backplane bus represented by dotted line 750. A generic ent processor module in the same order in which they were 

interprocessor c ommu n i cation service 751 for the central supplied to the IMS, Messages in the HRQ are given priority 

module, and a generic interprocess or communication service 3 over messages in the HTQ for transmit as well as receive 

752 for the remote module are provided. This service processing. However, the volume of traffic on the HRQ is 

provides an interface to all other processor modules in the supposed to be substantially smaller than that on the HTQ. 

system. The generic IPC interfaces with one or more bus Hence, messages on the HRQ arc processed in small num- 

drivers 753 on the central side and one or more bus drivers bers and messages on the HTQ are processed in large 

754 on the remote side. This way, cornmunication between 10 batches for better throughput 

specific modules can be handled in the system. Also, the IPC On the transmit side, the IMS provides quality of service 
interface 751/752 interfaces with one or more special registration based on transmit list fullness thresholds in 
services, such as the IMS logical layer 755 on the central software to ensure fairness and protection against overload- 
side, and IMS logical layer 756 on the remote side. A ing by any one message type. Each IMS message type is 
debugging service 757 is found on the central side and 758 15 assigned a drop priority. A message of a certain priority will 
on the remote side. Aboard manager 759 on the central side be dropped from being added to transmit list if the count of 
provides centralized management of the remote modules. empty command list entries is below an eligibility threshold 

The intciprooessor messaging system logical layer mod- (or watermark) set for that message priority. In other words, 

ule 755/756 is a significant part of the IPC services avail- the quality of service is a transmit side drop mechanism to 

able. The IMS provides a message based interface between 20 assure fair queuing. A message with the highest drop priority 

processor modules. An IMS subsystem on each processor (lowest reliability class) wOl have high threshold for free 

module is composed of a logical layer that interfaces with transmit list entries and hence the highest probability of 

client components, the physical layer that interfaces with being dropped. A message with a lower drop priority (hi gh er 

external processor modules and a generic IPC layer between reliability class) win have a lower threshold for free transmit 

me too. 25 list entries and hence the lowest probability of being 

FIG. 24 illustrates data paths on a remote input/output dropped Quality of service registration is not required for 

module such as an IOS or IOP. In FIG. 24, the remote system message types using "guaranteed" service, because the mes- 

indudes a basic kernel module 809 and an interconnect sage wOl not be dropped if any free entries are available in 

manager 691. A monitor module 802 and a debug task 803 the transmit list 

may be provided for system management The system 30 On the receive side, the IMS demultiplexes a large batch 

includes a plurality of network dependent drivers 805, a of IMS messages into smaller batches by IMS message type 

plurality of distributed protocol modules 806, and a mes~ and has a receive function invoked far each message type 

saging driver 807. Also, a network management agent 808 received The IMS is responsible for converting buffer data 

may be included. The network dependent drivers 805 type messages into buffer data and data areas, and collecting 

include physical network drivers (IOP1) 810, data link layer 33 segments and putting them together as a single chain of 

drivers 811, an inbound receive demultiplexer 812, and an buffers and batching these buffer chains together by their 

otitbound transmit queue manager 813. message type. The IMS provides a variety of receive func- 

The distributed protocol modules include the basic tion registration services based on IMS message header type 

Broutcr distributed protocol module 814, a bridge DPM 815, and IMS message type. 

and internet protocol (IP) distributed rjotocol module 816, 40 Each client provides a receive function that must be 

and other DPMs 817 as suits the particular implementation. invoked for a specific message identification. When two 

The distributed protocol modules are coupled with the clients register far the same message ioentrficadon, with two 

messaging driver 807 which includes an outbound receive different receive functions, the last registration takes effect 

demultiplexer 820, and an inbound transmit queue manager In order to ensure that no two clients assign the same values 

821. Core bus drivers 822 are also included, coupled with 45 for two different message type symbols, all message type 

the outbound and inbound paths for driving one or more core symbols must be centrally located in the header file in the 

busses to which the device is connected. The messaging IMS logical layer component The reception of messages, 

driver 807 implements the IMS layer modules as discussed whether on high throughput queue or high reliability queue, 

above under the control of the interconnect manager 801. is transparent to clients. Registered receive function is 

The uterprocessor messaging system is specifically 50 invoked no matter which queue a message came in on. It is 

designed to meet the needs of control and data-in-transit expected that a message is always sent on the same type of 

traffic patterns in the scalable, flexible distributed router message queue. 

system according to the present inventioa For each message The high throughput service and high reliability/low 

type, based on the traffic pattern anticipated for the system, latency service are intended primarily for transport of buffer 

an IMS message queue for high throughput or high reliabil- 55 data, that is Buffer Data (BD) descriptors, and data pointed 

iry and IMS drop priority are assigned. The table shown in to by BD descriptors. The IMS message header type 0 is 

FIG. 25 is a summary of the various IMS message types used to transport buffer data. Buffer data can be just a single 

according to one embodiment of the invention, their service buffer; a chain of buffers or a batch of chained buffers. IMS 

requirements and the quality of service assigned to them as subsystem on the local processor will convert these buffers 

a result Note mat the drop priorities and other parameters 60 into messages and transfer the messages over to remote 

associated with these messages can be modified to suit the processors through the IMS. The data messages may be 

needs of a particular environmenL selectively dropped based on quality of service assigned to 

In FIG. 25, HRQ stands for high reliability queue, and the message type, The IMS mainline statistics of messages 

HTQ stand for high throughput queue. transmitted, discarded, and failed. 

Thus, the IMS offers three types of transport services — (1) 63 Guaranteed message service is provided on top of high 

high throughput service using the HTQ, (2) high reliability, reliability, low latency IMS message service using the HRQ. 

low latency service using the HRQ, and (3) guaranteed Messages that could not be queued for sending will be 
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queued internally for retrying at a later time instead of central processor, or another input/output processor depend- 
dropping. IMS guarantees that data supplied to local IMS log on the destination of the packet 
subsystems wul be delivered by the recipient IMS in exactly FIGS. 31 through 32 provide an example of the interpro- 
the same order in which is was supplied and without cesser messaging system logical layer processing for trans- 
replication. In one preferred implementation, the retry 5 fers from the central resource to a remote processor, and 
attempts are made at a minimum rate of every 100 milli- from the remote processor to the central resource respec- 
seconds. trvely. Assuming that message buffers on the centralized 

The IMS message type header 02 is used for transport of resource are 512 bytes long and that buffers on the remote 
kernel messages and header type 64 is used for transport of systems are 256 bytes long, the examples will operate as 
frame driver type messages. However, header types used by 10 described. Also, in this example, a single packet per batch is 
the IMS are not limited to these and may grow as suits the used The sample packet type is IMS data, the packet is 700 
needs of a particular installation. bytes long, and when transmitted through the interprocessor 

FIGS. 26 through 29 illustrate the IMS message and messaging system, a header of 8 bytes (assuming for mis 
header formats. Each IMS message shown in FIG. 26 has a example mat there is no header data (902 of FIG. 26)) is 
header, generally 900, which includes an IMS header 901 is prepended without any additional padding to the message, 
and header data 902. The header data 902 includes, for thus the message size become 708 bytes. Thus, a message of 
example, portions of a buffer descriptor for a frame in transit size of 70S bytes is transferred over the high throughout 
which sets out status information about the frame. A pad 903 queue from the central processor to the remote input/output 
may be used to fill in an area between the beginning of the module in FIG. 31, and from the remote input/output module 
packet and a buffer data offset Buffer data is carried in the 20 to the central processor in FIG. 32. 
region 904 and may be padded with a trailing pad 905. Thus, a logical layer issues a command (e.g. 1020) to send 

Id one system, the IMS may support three categories of to the buffer descriptor beginning with buffer descriptor 
messages as shown in FIGS. 27, 28, and 29. Each of these BD-A, with an IMS data message type to a destination slot 
messages have IMS headers with fields indicating the header 00. Thus, the buffer descriptor BD-A is accessed and 
type, the header length in words, the message length in 25 includes the fields as shown at block 1021. The first line in 
bytes, and the buffer data offset in bytes at a *ninin»iTn In the buffer descriptor BD-A is a pointer to the next buffer 
FIG. 27, the BD message header format includes a trace bit descriptor, buffer descriptor B-DB which includes the fields 
908, a header type field 909, and a header length field 910. shown at block 1022. The 708 byte packet thus includes a 
The buffer data offset is stored in field 911. A message length buffer of length 512 bytes, and a buffer of length 188 bytes, 
is specified in field 912. A message type is specified in field 30 The address for the buffer data is stored in the descriptors as 
913. The last segment is unused. shown. 

The IMS kernel message header format shown in FIG. 28 For all the packets in a batch, the message header is 
begins with a trace field 915, includes (he header type field prepended preceding the data buffer of the first segment at 
916, and a header length 917. The buffer data offset is stored the desired data offset, and the address of the start of the EMS 
in held 918. The message length is stored in field 919. The 35 message header is set Thus, the message type is IMS data, 
next word must be all zeroes, followed by a sequence the message header size is 8 bytes, the data offset within the 
number field 920 and a receive sequence number 921. The message is 8 bytes, and the message link is 708 tytes. Next, 
next field identifies the user message type 922, and the last the logical layer determines the transmit Hst drop threshold, 
field provides a remote mailbox identification 923 for kernel based on drop priority or qualify of service of the IMS 
messages. 40 message type. Next, the algorithm determines which inter- 

FIG. 29 illustrates the IMS frame driver message header processor controller transmit service to use , either the high 
format Again, this format begins with a trace field 925 and throughput or high reliability queues. Finally, the appropri- 
includes the header type field 926 and the header length field ate interprocessor communication transmit function for the 
927. The buffer data offset is provided at field 928. The destination slot based on the transmit service required is 
message length is provided in field 929. Ibe message type 45 invoked In this example, the command for transferring IMS 
is set out in field 930. The last two fields provide the send data to the high throughput queue is called for me destina- 
sequence number, field 931, and the receive sequence tion slot beginning with buffer descriptor BI>-A with a 
number, field 932. quality of service threshold specified. The IOS driver located 

FIG. 30 snmrnariz^s the interprocessor messaging system on the source processor, that is the central processor in this 
using the two types of queues for buffer descriptor type 50 example, executes the transfer using its high throughput 
messages. Tims, on the centralized processor, or another command list when the header for the command list reaches 
intelligent processor, illustrated at block 1000, a high the appropriate entry in the command Hst 
throughput queue htqtx 1001 and a high reliability queue On the receive side, the logical layer demultiplexes a 
hrqtx 1002 for rransrnitting commands are provided. Also, a batch of receive messages into sub-batches by individual 
high throughput receive Est queue htqrx 1003 and a high 55 IMS message type. A dient receive function is called for the 
reliability receive list queue hrqrx 1004 are included* The batch of received messages beginning with the buffer 
send buffer descriptor command from the logical layer descriptor of the first buffer for the batch, In mis case, it is 
system for the interprocessor messaging system stores a buffer descriptor BD_J?. Thus, a first buffer in the receiving 
command in the appropriate list The bigi throughput queue device is loaded with 256 bytes, the first 8 bytes of which are 
sends the IMS message to the high throughput receive list 60 the header, which can be discarded. Thus, the buffer descrip- 
1005 on the destination input/output module 1006. Also, tor includes a pointer to the next buffer BD-Q, a buffer 
high reliability commands are transferred to the high reli- length field and a buffer data address with an 8 byte offset 
ability queue receive list 1007 on the remote device. A to discard the header at address P+fc. A buffer descriptor 
simflar path exists from the high reliability command list BEMJ points to the next buffer descriptor BD-R, stores the 
1008 and the high throughput command list 1009 on the 65 full 256 bytes at address Q. Buffer descriptor BD-R indicates 
remote device 1006. These messages are transferred to the that it is the last buffer in the batch by a null next field, has 
high reliability and high throughput receive lists on the the balance of the data in it, beginning at buffer data address 



04/30/2004, EAST Version: 1.4,1 



5,802,278 



37 



38 



R. The demultiplexing occurs in response to (he high 
throughput queue receive list, when the header for that list 
reaches the appropriate entry. 

The IOS driver on the central processor adds entries to the 
transmit queue and updates the transmit tail pointer. Then it 
issues an event to the remote IOS which is to receive the data 
transfer. When the transmit head pointer is updated later on, 
the IOS driver frees up the transmit buffers from the last 
transmit head until the new transmit head pointer. On the 
receiving device, the central device driver queues up a DMA 
control block (DCB) which contains the source address, 
target address, length of the data to be copied, the data 
transfer type, and the status of the DMA transfer. The 
significant transfer bit is set in the DCB. The DCB is used 
to fetch the set up of the transfer from the central processor. 
When the DCB is complete, the transmit cache tail is 
updated to match the transinit tan pointer in the set up. Then 
one or more DCBs is queued up to copy newer entries in the 
transmit list to the end of the transmit cache list When the 
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data address is 8 bytes past the beginning of buffer BD-A as 
indicated in the figure. Buffer descriptor BD-A points to 
buffer descriptor BD-B which stores the balance of the data 
at the address at the beginning at point B. 

In the central device driver on the remote processor, a 
DCB is queued up with a bit set indicating a transfer Across 
the IMS to the central system. The shared memory set up is 
fetched from the central system using this DCB. When this 
is completed, the receive cache tail in shared memory, and 
the receive statu s cache tail in shared memory pointers in the 
receive manager are set to manage the receive buffer tail 
pointer which was retrieved from the central processor. A 
DCB is queued up to copy newer entries in the receive buffer 
list in the central processor to the end of the receive buffer 
cache list in the remote processor. When the list has been 
updated, the receive buffer cache tail and receive status 
cache tail are updated. Then a batch of transmit entries are 
processed to transfer into the receive buffers listed in the 
receive buffer cache. A DCB is queued up for each coctigu- 



transmit cache list in the central device driver on the remote 20 ous data transfer. For each receive buffer, when the last DCB 
processor is complete, the transmit cache tail pointer is using an address from that buffer is en queued, status for the 
updated. Next, a batch of transmit cache entries is processed buffer is set in the receive status cache, Next, the receive 
to transfer data into receive buffers. A DCB is men queued status cache entry at the head pointer is triggered, and the 
for each contiguous data transfer. For each receive buffer, next receive status cache entry is updated. Once the trigger 
when the last DCB using an address from that buffer is 25 DCB is completed, a DCB is queued up to copy the newer 



status cache entries to the central processor. Also the receive 
buffer cache pointers are updated to their trigger pointers, 
and the corresponding structures in shared memory are 
updated. 

IV. DISTRIBUTED PROTOCOL PROCESSING 
As mentioned above, the flexible architecture supported 
by the interprocessor messaging system, the high speed 
buses, and the variety of architectures which may be con- 
nected using this system support distributed protocol pro- 



enqueued, receive buffer and flag fields are sent to the 
receive list Then, the hansmit cache head pointer is updated 
to the next entry for processing. When the process 
completes, the transmit cache head pointer in the shared 
memory for the central processor is updated to match the 30 
transmit head in the cache on the local device. Next, a DCB 
is queued up to transmit the set up data from the IOS to 
shared memory, in the central processor. 

FIG. 32 illustrates the process in reverse from the remote 

intelligent processor to the central processor. This system 55 cessing. According' to the present invention,' the general 

receives an IMS send buffer data command at the logical distributed protocol module (DFM) model operates with a 

layer identifying the first buffer descriptor for the batch, the cache of recently accessed addresses maintained in each of 

message type, and the destination. Thus, for all packets in a the intelligent input/output modules. The cache contains a 

batch, the message header is prepended, preceding the data subset of the information contained in a routing table 

buffer of the first segment at a desired data offset, and the 40 maintained in the central processor. Packets received in the 

buffer data address at the start of the IMS message header. DFM for destinations which are in the local cache are 

This header indicates the message type as IMS data, and that forwarded directly without consulting the DFM server on 

the message header size is 8 bytes, the data offset within the the central processor. Packets received for destinations 

message id 8 bytes, and the message length is 708 bytes. which are not in the cache result in a query from the DFM 

Next, the logical layer determines the transmit list drop 45 to the central DFM server to determine an appropriate 

threshold, based on drop priority or quality of service of the destination. 

IMS message type. Finally, the transmit service to use is The scalable high performance system according to the 

determined based on the message type, either high through- present invention provides the fetetprocessor messaging 

put or high reliability. Finally, the ar^ropriate IPC transmit system interconnecting intelligent input/output modules 

function is invoked for (he destination slot based on the 50 known as IOPs and IOSs in w>mmunication with the central 

required transmit service. This results in a command indi- internetwork processor, and with other IOPs and IOSs, and 

eating a high throughput transmit function indicating the through the central processor to IOMs. foterproccssor mes- 

destination, the source buffer, and the quality of service sages may be up to 64K bytes long according to one 

threshold. This message is added to the high throughput embodiment They convey data packets, state information, 

command Hst as shown whfr a fim 55 and control information between data processing functions 

BD-P, a second entry for buffer descriptor BD-Q, and a third or process instances on different cards in the system. Mes- 

cntry far buffer descriptor BD-R. On the receive side, the sages to and from IOPs/IOSs and the central processor are 

receive buffers are loaded by the hardware, and the logical passed through data structures in the central processors 

layer demultiplexes a batch of received messages into sub- shared memory. IOP/IOS to IOP/IOS messages are passed 

batches by individual IMS message type. The client receive 60 directly from memory of one IOP to that in another IOP/IOS 

function is invoked for each IMS message type received, and message passing controller. Distributed protocol modules 

executed when the receive list bead reaches the appropriate according to the present invention are clients of the inter* 

entry. Thus, the client receive function writes the incoming processor messaging system. FIGS. 32 and 33 are used to 

batches to buffer descriptor BD-A indicates that the next describe two representative distributed protocol processing 

buffer descriptor buffer descriptor BD-B, and the buffer data 65 systems relying 011 the interprocessor messaging system, ft 

length and the offset Again, for 512 byte buffer the first 8 will be understood that any logical layer processor rnessag- 

bytes are header which may be discarded. Thus, the buffer ing system can be utilized for the distributed processing 
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model discussed* However, the high throughput and flex- 
ibility according to the IMS described above is used in the 
preferred system. 

FIG. 33 illustrates a distributed internet protocol (IP) 
process with a distributed protocol module, generally 1100 
on an IOS card, and a central IP process, generally 1101, on 
the central processor. As mentioned above, an IMS driver 

1102 on the remote card and an IMS driver 1103 on the 
central card are utilized for communicating. Basic compo- 
nents of the distributed protocol module 1100 include the 
receive protocol dependent processing module 1104, a pro- 
tocol address cache 1 105, and a packet disposition process 
1106. The data coining in on the I/O ports 1107 far the 
remote card are coupled to the receive protocol dependent 
processor 1104. This processor utilizes the protocol address 
cache 1105 to attempt to route (he packet The packet is 
passed through the packet disposition process 1106 if 
possible, for communication either back out the ports 1107 
or across the IMS to the appropriate destination. If the 
receive protocol dependent processor 1104 cannot route the 
pocket, then it uses the IMS service 1102 to request an 
update to its protocol address cache 1105. The IMS module 

1103 on the IOS routes a packet to the distributed protocol 
module server 1108 in the central processor. Using protocol 
address cache support services 1109, data to update the 
protocol address cache 1105 of the remote processes is taken 
from the complete protocol routing UuMes U10 m the central 
processor and forwarded to the cache 1105 across the IMS 
system. 

If the receive protocol dependent processing 1104 deter- 
mines that it has received a type of packet which cannot be 
routed using the distributed protocol service, then it is 
passed through the IMS 1102, 1103 to the complete protocol 
processing services 1111 in the central processor. These 
services rely on the complete protocol routing tables U10, 
and forward the rooted packet to packet dlsposilioa services 
1112 which utilizes the IMS to route (he packets to appro- 
priate destination if possible. If the destination is an IOM 
module without IMS services, then local drivers for the IOM 
ports U3 are utilized by the packet disposition system 1112 40 
to drive the packet out the appropriate port in the IOM 
module. Packets incoming from the IOM module are sup- 
plied to the complete protocol processing resources 1111. A 
central DPM server which emulates the DPM interface for 
the protocol routing processing 1111 can be utilized for the 
K)M porta as illustrated at block 1114. Also, the central 
processor can execute remote DPM configuration services 
1115 utilizing the IMS messaging system through the DPM 
server 1108 if desired. 

Thus, a distributed protocol module provides protocol 
specific processing on an intelligent VO card. In the present 
architecture, all packet processing is done in the centralized 
code residing in the central processor, unless DFMs are 
located on the receiving processor. With DFMs, protocol 
processing is distributed to the intelligent I/O cards, which 
rely on the central resource for many functions. The basic 
goal is to perform packet forwarding computations for the 
majority of packets on the intelligent I/O card, rather than 
sending all the packets to a full function routing processor. 
This can be achieved by implementing some or all of the 
packet forwarding fast path on intelligent cards while keep- 
ing the control functions including higher protocol layers 
centralized on the central processor. Intelligent VO cards 
will maintain routing caches for a distributed protocol. Hie 
DPM will try and make the switching decision locally. This 
is done by looking up in the local cache for a destination 
address. In the case there is no cache entry, then the packet 



will be queued in a local memory, and a protocol cache 
query (PCQ) is sent to the central processor. The code in the 
central processor will reply with a protocol cache reply 
(PGR) or may not respond at alL Based on the PCR, or the 
lack of one, the DPM will either route the packet to the 
destination port, send it to the central processor for routing 
there, or discard it In the case the DPM cannot process the 
inbound packet because processing is required which is not 
supported by the DPM, the packet is sent to the central 
processor to be processed in the normal data path. For IP 
routing in the scalable platform of the present invention the 
following cases illustrate data flow. 
Case 1: Unicast, known or unknown destination, received 
and sent on the same input/output module the IOM 
style. In this case there are do distributed protocol 
modules involved. The packet is always sent to the 
central processor for processing, even if the destination 
port is on the same VO card. 
Case 2: A unicast packet, destination not in cache, 
received and sent on the same intelligent VO module, 
The DPM on the intelligent VO module wOl route the 
packet locally. A PCQ/PCR exchange will take place 
with the central processor to determine the route for the 
packet The data packet will never cross the bus. 
Case 3: A unicast packet, destination in cache, received 
and sent on the same intelligent VO card. The packet 
contains information that the DPM cannot process (for 
example options in the header). The packet will be 
forwarded to the central processor for processing by the 
normal path. 

Case 4: A unicast packet, destination in cache, from one 
intelligent I/O card to another intelligent VO card The 
distributed protocol module on me receiving intelligent 
VO card will take the routing decision to route the 
packet to a remote intelligent VO card. The IP code in 
the central processor is not involved. 
Case 5: A unicast packet, known destination from an I/O 
module to an intelligent VO card. The IOM card 
receives the packet and sends it to the central processor 
The routing decision is made in the central processor: 
The packet is sent to the intelligent VO card for 
transmission. In this case, there is no distributed pro- 
tocol module involved. 
Case 6: A unicast packet, with the destination in the cache 
is sent from an intelligent VO card to an IOM cant The 
packet is received on the intelligent I/O card. The DPM 
makes a routing decision and send the packet to the 
central processor. The protocol code on the central 
processor is not involved. The packet is place directly 
in the output queue in the central processor for the 
destination port 
There are several logical components involved in sup- 
porting routing when distributed protocol modules are 
55 involved. In the infr-ffigtnt VO cards, there is an IP distrib- 
uted protocol module. This receives inbound packets from 
the ports. It is responsible for routing the packets to the 
destination or forwarding them to the central processor for 
further processing. On the central processor there is an 
internet protocol module which provides the IP norinal path 
and quick path, and routing table maintenance for all the 
configuration functionality. The "normal" path, sometimes 
called slow path, processes with more complexity. The 
"quick" path, sometimes called fast path, is optimized for 
the maj ority of packets which do not need complex routines 
that handle several exceptions and special cases, like the 
slow path. Also, since there is no DPM on the central 
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processor, a component is involved in the data path for packet may be sent to the central processor for processing by 

packets routed by the IP DPM in one intelligent I/O card to the IP normal path or the IP quick path, 

a port on a IOM card. Further, a DPM server is located in the B. The packet may be forwarded to the central processor 

central processor which serves as a central server for the n> normal path are for the following reasons 

various DPMs in the system. Its primary responsibility is to 3 L Packet destination IP address is unicast and is the router 

isocess cache related messages to and from the DPM. itself. These are end system packets and are usually handed 

The inbound receive protocol dependent processing 0 ff to the upper layers. Packet is forwarded using the IMS 

server 1104 in the DPM performs basic header validation high reliability queue HRQ. 

based on versions, check sums, links, et cetera. In one 2. Packet destination IP address is broadcast/multicast 
embodiment, fast switching for the IP DPM will only apply 10 These are usually network control packets. Packet is for- 
te IP version for packets with no options. All other packets warded using the IMS high throughput queue HRQ. 
are forwarded to the central processor for the routing deci- 3. IP Security Options processing is enabled. Packet is 
sion. Of course the DPM may be enhanced to provide forwarded using the IMS HTQ. 

services for other kinds of packets if desired. 4. Packet contains IP options. Packet is forwarded using 

The inbound receive protocol dependent processing also 15 the IMS HTQ. 

determines whether local routing or a transfer to the central 5, Packet requires rxagmentation. Packet is forwarded 

processor must be made by the MM. Also the inbound using the IMS HTQ. 

receive protocol dependent processing does routing lookups 6. Bridge Source Routing is enabled, 

in the cache and does next hop dctenninations for the packet 7. The PGR indicates that the destination interface is a 

I? US * **** 10 *f h0p f * 20 type which is currently not supported in the DPM routing 

the packet. If not entry is found, the packet will be queued path. Packet is forwarded using the IMS HTQ. 

and a protocol cache query win be sent to the central 8. The PCR indicates that no route exists to the destination 

pf °^ S< ^, 3 ^ Cd Ksp™™* or the lack of one, the jp ad^h^. The packet must be sent to the central processor 

packet wfll either be forwarded to central processor, routed for processing. The central processor IP needs to maintain 

to a destination port, or discarded. Finally, the inbound ^ statistics and a ICMP message may need to be generated, 

recede protocol dependent processing perforins the full IP Packet is forwarded using the IMS HTQ. 

filtering hmcti^xty. Tins involves niaintaimng multiple 9Xhc pcR i^j^ ^ mc d^tin^port is (be 

copies of the IP filtering configuration, one on the central same port it was received on. Packet must be sent to the 

^^jf^Z^^ ^f^^ ?"? **** <* fltral P™~*«r for processing. The central processor IP 

has an active IP DPM. Filtering wfll be applied in the DPM 30 needs to maintain statistics and a ICMP redirect message 

^Lf ^ j^f^J^ D ™ ^ may need to be generated. Packet is forwarded using the 

packet That is if the packet has to be sent to the central rMS HTQ 

ZZ^^SZ^ processing, then no filtering action is C The packet may be forwarded to the central processor 

aA.^,. ^ . . IP quick path for the following reasons. 

^i ^n^^^^^^^J 1 ^ 10 ^ 35 I- Debug controls iiicacate pack* shaM 

?^^ C r!^^ g J! a ^ £ P^ 0 ^ * the central processor. In this case aO packets will be for- 

^tT? * warded to the central processor for F ocessing by the IP 

updated in die headex The IP DPM can perform allneces- quick path. Packet is forwarded using thTS^fflXJ. 

^^^^^^Z^ 1 ^^^ 2. The protocol address cache PAC has reached the 

sulation of t^ inb<^packet is Ethernet and the destina- maximum limit of the entries it can have. An additional 

rXTT^^^^ of * e * PACE entry cannot be created. Tne packet is forwarded to 

DPM to format theheader before the packet is forwarded. me ^ processor for rxoccssing byAe quick path. 

Par instance, oneimrHrmrntation of an IP DPM will support D . The packet may be routed to destination interface 

conversion to the following MAC headers: any LAN type, directly m Showing cases. 

"^VT Et ? M f^L T ° 1 !^S il ^ roE>1 - ^ t t _ 0 ^ ra rL* e 45 t If a valid entry for me destiiiation is fbun^ 

^^T^^T 1 ^ ******* luit For afl other a successful PCR which indicates that the packet should be 

cases the packet is forwarded to the central ptoces sor for routed and the PCR contains the valid routing information, 

P™*^J£ <* ? wrsc > ^ DPMs 0111 me packet wffl be forwarded to the destination interf ace. 

support additional MAC layer header conversions. 2. If the destination interface is on a remote slot, the 

Packet disposition is handled as follows: 50 packet will be sent in an IMS HTQ message to the desti- 

^^^n^^ I ^ C ^ t ^^^ p( ^ <>fill0llC nation slot peer DPM or the IP CDPM on the central 

of the following ways. processor. 

L Discarded 3 ff the destiiiation interface is ^ 

The incoming packet can be discarded. This can happen ^ ^ mc output quc^for^port 

for the following reasons: J3 The ftotocol Address Cache PAC 1145 is nonaged as 

a. No response was received from the Protocol Cache next described. 

Query for the requested route. This does not indicate The DEM will maintain a local routing cache. Routing 

that there is not route to the destination. It means that will be performed by looking up the destination addresses in 

either the PCQ or PCR was dropped somewhere in the the local cache. A query/response system wfll be used 

s > fstem ' 60 between the DPM and IP DPMs on server (DOMSS) 1108 

b. Filtering. The filtering database contained a matching on the central processor to obtain new routing information 
condition with the action of discarding the packet for insertion into the cache. 

2. Forwarded to central IP resources The PAC can be a state rnachine and be event driven, 

Since the IP DPM contains a subset of the complete IP Protocol Address cache entries PACE are created as 

routing functionality, the DPM may not be able to fully 63 follows. 

process the packet For such cases, the packet wfll be A PACE will be created when a packet is to be routedand 

forwarded to the central processor for processing. The no PACE entry exists for the destination IP address. In this 
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case a PACE is created, the packet is queued to the PACE 1. receiving Protocol Cache Queries form a DFM and 

and a PCQ is sent to the DPMS to determine the route. All responding to them with the appropriate Protocol Cache 

subsequent packets for the same destination are queued until Reply. 

a PCR is received. 2. flushing DPM Protocol Address Caches when appro- 

If a PCR is received, the PACE is updated and based on 5 pr^- This ^ be done w^ 11 lar K e scaic changes are made 

the PCR information the queued packets are either routed to to * c central IP routing databases. For example, when the 

the destination port or forwarded to the central processor for routing table is flushed, a port comes up or goes down, et 

processing. cetera. 

If no PCR is received and a set time expires, the PACE is The IP CEC also provides IP DPM receive function 

deleted and all queued packets are discarded. 10 registration. The IP DPMS wfll indicate the set of unicast/ 

If the PAC has reached the maximum entry limit, an broadcast functions that should be used by the IBD to 

additional PACE will not be created. All packets which forward packets to the IP DPM. This will be done through 

require an additional PACE to be created will be forwarded *» sid_register_d>m_.tag function call, 

to the central processor for processing by the quick path. The IP CEC is also used for IP configuration. The central 

Protocol Address Cache maintenance is handled as fol- 15 processor L? maintains the master user configuration for the 

j^j DP protocol. IP DPMS is responsible for communicating the 

The maintenance is purely timer based. A PACE will appropriate configuration to the DPMs. Configuration infor- 

become stale after period of time and will need to be mation is distributed to the DFM on notification that a DPM 

refreshed by issuing a PCQ to the DPMS. needs intiaHzing (via the receipt of the ICM-PROCESS_ 

A PACE can be deleted if the age timer expires indicating 20 VP message), or whenever the relevant configuration 

that the entry is not longer valid. changes. 

The complete PAC can be flushed via command, or by the The configuration information that needs to be comma- 

DPMS when the routing table or the address table in the nicated includes: 

central processor is flushed. 1- status of the IP routing functionality: whether the DPM 

Cache maintenance will be the core of the DFM tunc- 25 shouldprocess packets locally or forward all packets to 

tionalfty. It will include: the central processor for forwarding. In certain con- 

1. queuing of packets for destinations not in cache. figurations (for example when IP security is enabled) 

2. dispatching of Protocol Cache Queries to the IP DPMS all packets must be forwarded to the central processor 
on the central processor for unknown routes. for processing. 

3. updating of local cache entries from Protocol Cache 30 2. keeping to IP filtering databases in synchronization. C 
Reply messages from the IP DPMS on me central processor. copy of the IP filtering database wfll be maintained on 

4. supporting a rate limiting mechanism for PCQs. the centi^ processor and the U/O cards by the DPM. At 

5. cache maintenance including aging, refresh, re-use, initialization, and whenever there is a change in the 

6. forwarding or dropping of queued packets based on filtering database, the configuration wifl be downloaded 
successful response or aging. 35 to the DPMs. 

7. supporting general controls including enable, disable, The IP CDPM component is involved in the data path of 
flush, display. packets switched by a DFM. The packet flow model for 

Per-poxt statistics will be maintained The following routing with DPMs advocate that the data flow should be 

counts will be maintained: from a DPM to a peer DPM. That is, a routed packet should 

1. packets received from network. 40 be sent from the DFM on the card it was received to me 

2. packets discarded. DPM on the destination slot It is the responsibility of the 

3. packets routed to port on local WO card. DPM on the destination slot to place the packet on the 

4. packets routed to remote WO card. appropriate output queue. 

5. packets sent to CEC for exception processing. Since the central processor has no DPM component, the 
Additional counts of the cache and dataflow statistics are 43 IP CDPM provides the subset of the DPM functionality 

available under debug. These statistics will be available to necessary for receiving and forwarding routed packets, 

the central processor tor display via me user interface. IP CDPM will register the necessary functions to receive 

The IP DIM will be supported on the central processor by packets routed by a DPM for transmission to an IDM port 

the following logical components — IP CEC IP DPMS and This function wfll gather statistical information before sendV 

IP CDPM. 50 ing the packets to be placed on the outbound transmit 

IP CEC queues. 

This component is not directly involved in the data path The IP specific messages used in the distributed process- 

of packets routed by the IP DPM. This component provides; ing include the following: 

1. complete IP protocol processing. This includes main- 1. PCQ message to look up a route in the central routing 
taining the IP rooting tables, processing configuration 35 table. This is transferred in the high reliability queue with a 
information, et cetera. priority of one from the distributed protocol module to the 

2. IP "quick" path packet switching for packets received distributed protocol server in the central system. 

from I/O cards which do not have a DPM, or from intelligent 2. A PCR message in response to a PCQ. This can contain 

I/O cards where the DPM has been disabled. the vaUd next hop information or an indication to send the 

3. IP "normal" path processing for any packet received 60 data packets to the central processor. This message is 
from any interface. This includes processing of exception transferred on the high reliability queue with a priority of 
packets from a DFM. one from the DPM server in the central processor to the 

IP DPM Server (IP DPMS) remote IPDFM. 

This component provides IP DPM cache support The IP 3. Packets routed by a DPM to a remote intelligent card 

CEC will maintain the master IP routing tables. The DPMs 65 or to an IOM card without intelligence. This packet is 

on the I/IO cards maintain local Protocol Address Caches. transferred in the high throughput queue with a standard 

The IP DPMS will be responsible for priority, either from an IP distributed protocol module to a 
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second IP distributed protocol module in the destination 
intelligent I/O card, or from the UP distributed protocol 
module to the IP central distributed protocol module on the 
central processor for routing to the IOM card. 

4. Exception control packets (broadcast or unicast) or if 
the destination IP address is the local address, are sent to the 
central processor for processing in the IP normal path. These 
packets are sent by the higher reliability queue with standard 
priority from the distributed protocol module to the central 
routing resources. 

5. Exception data (unicast) packets sent to the central 
processor for processing by the IP normal path. These are 
transferred in the high throughput queue with standard 
priority from the distributed protocol module to the central 
processors. 

6. Packets sent to the central processor for processing by 
the IP fast path used for debugging purposes or when the 
local protocol address cache has reached the 

number of entries. These packets are sent in high throughput 
queue by standard priority from the IP distributed protocol 
module to the central IP resources. 

7. Packets used to convey commands from the central unit 
are transferred to the kernel with guaranteed delivery from 
the distributed protocol module server to the distributed 
protocol module. 

8. Configuration information is downloaded to the DPMs 
from the central processor with guaranteed delivery from the 
distributed protocol module server to the distributed proto- 
col module. 

9. The DPM control with guaranteed service for enable, 
disable, and flush commands are sent with guaranteed pri- 
ority from the server to the distributed protocol modules. 

KG. 34 illustrates the components of a distributed trans- 
parent bridging TB process using the distributed protocol 
module system according to the present invention. Thus, a 35 
distributed protocol module for the bridge process fanftyfcy 
the components illustrated at generally 1300. The central 
rxocessor includes the bridge process 13#1 shown in the 
figure. Thus, me distributed protocol module includes the 
receive bridging dependent processing 1302 which includes 40 
a source address learning resource 1303, a bridge address 
cache 1304, and spanning tree support 1305. The bridging 
dependent processing 1302 communicates with the packet 
disposition module 1306, and with the IMS resource on the 
local card 1307. Packets art received on the VO ports on the 
processor executing the distributed protocol module 1308 is 
the VO ports. Also, the packet disposition resources 1306 
can route packets directly back to the FO ports. 

The central bridge process includes a transparent bring ing 
DPM server module 1310 which includes a remote DM 
configuration service 1311, and bridge cache support 1312. 
The complete bridge processing resources 1313 are included 
in the central processor. Also, the central bridge routing 
tables 1314 are mainta in^ here. 

The complete bridge processing resources 1313 commit* 
nicate with the packet disposition services 1315 and with the 
IOM ports 1317 served by the central processor. Also a 
CDPM resource 1318 on the central processor is utilized for 
facilitating the interface between the DPMs and the IOM 
ports 1317. 

In a simpler version of the Bridge DPM, most transparent 
bridging features are implemented directly by the Bridge 
DPM running on the intelligent I/O (W0) cards (such as an 
IOP or IDS). However, some features are not directly 
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mented within the Bridge DPM. The transparent bridging 
features not implemented within the Bridge DPM as 
described below can be migrated from the central processor 
to the bridge DPM on the WO card. 

The following transparent bridging features are imple- 
mented directly within the Bridge DPM: 

1. dynamic learning of MAC addresses. 

2. aging of inactive MAC addresses. 

3. firewall support 

4. broadcast limiting support 

The following transparent bridging features are not imple- 
mented directly within the Bridge DPM this version and 
require exception processing by the central processor: 

1. translation bridging. 

2. mnemonic filtering. 

3. source and destination address-based security. 

The Bridge DPM Cache 1304 is maintained on the D/O 
cards. The Bridge DPM on the WO rnaintains a cache of 
recently encountered MAC addresses which is referred to as 
the Bridge DPM Protocol Address information Cache 
(PAQ. Each PAC entry (PACE) contains information related 
to a single MAC address. The PAC is used to determine how 
inbound packets are processed by looking up the destination 
MAC address in PAC Hie PAC is also accessed during the 
source address SA learning and refresh process. 
PAC entries are created when: 

1. An inbound packet is received for a unicast destination 
addres s not currently in the PAC This results in a query 
from the Bridge DPM to the Bridge DPM Server to 
determine the appropriate destination port The Bridge 
DPM issues a PCQ message and the Bridge DPM 
Server responds with a PCR message. A FIFO queue is 
maintained far the PACE to queue the original packet 
and any additional packets (up to a limit) received for 
same destination while waiting for the PCR. 

2. The Bridge DPM Server distributes local MAC 
addresses to the Bridge DPM by issuing CONFIG 
messages. Local MAC addresses are distributed during 
Bridge DPM initialization and updated whenever a 
local MAC address is added or deleted during opera- 
tion. PAC entries for local addresses are not subject to 
aging and remain in the PAC until explicitly deleted. 

3. An inbound packet is received with a SA not already in 
the PAC and PCU is posted to the central engine 
controller OEC 

PAC entries are deleted when: 

1. An inactive address is aged out by the Bridge DPM 

nurinfffaumraft function. 

2. Tlxe Bridge DPM Server issues a CONFIG message 
when a local address is deleted. 

3 . Hie Bridge DPM Server issues a CONFIG message to 
flush entries in the PAC in response to a user request or 
as a result of topology change detected by STP. 

Entries in the PAC are refreshed by periodically issuing 
queries (PCQs) to the central bridge routing table on the 
CEC A PACE is refreshed (updated) when a PCR is posted 
to the Bridge DPM by the Bridge DPM Server on the CEC 
in response to a PCQ* 

Source Address SA learning and refresh are handled in the 
system as follows. 

Source address learning occurs in the Bridge DPM when 
new SAs are encountered in received inbound packets. 
Refresh is used to prevent previously learned active (Le. 



transmitting) stations from aging out of the Bridge DPM 
implemented by the Bridge DPMs. Bridged data packets are 65 cache and the central bridge routing table on the central 
forwarded to the central processor for processing as excep- processor. Source address learning and refresh is not applied 
tion packets when a required feature is not directly impie- to outbound packets. 
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SA learning and refresh for inbound packets is performed This value is used later by the DLL demux to determine how 

by looking up the S A in the PAC SA refresh is performed received packets should be processed and in some cases it 

if an entry already exists in the PAC for the SA. It involves contains the destination port number for packets to be 

updating the age flag in the PACE for the SA. bridged The appropriate return value is determined by 

SA learning is performed if no PAC entry exists for the 5 looJdng up the DA m the Bridge DPM PAC. 

SA. It involves sending a PCU message from the Bridge J^^-^^^J^^^ n^tt 

DPM to the cezuralpjScssor indicating the location (port) ^ 

of the SA. A PCU rnessage is not generated for each ne^SA £*er. » * £^^1*1^ 

encoun^tcad,^ !fti2^ 

buffer. The PCU is sent to the central processor asynchro- 10 for ^tende* periods. Bridge age refreshing is performed by 

nously by a maintenance function when a timer expires. This sctdng ^ « act iYc w bit in PACE for the SA and incrementing 

allows batclies of SAs to b< placed in a single PCU message. a per-port filtered packet counter. 

A check is also made to determine if the packet was SA learning and bridge age refreshing for packets which 

tagged as BDFM_LBARN_S A (in a driver level filter m forwarded by driver are not performed within bdpm_ 

function described below). If so, the packet is discarded 15 dvr_^filter. Instead, it is deferred until the packets are 

since it is a local traffic packet mat was forwarded to the received by the Bridge DPM. This approach require that 

Bridge DPM solely for the purpose of SA learning. some local traffic packets are "leaked* (forwarded) to the 

The protocol address cache maintains entries in three Bridge DPM solely for the purpose of SA learning. These are 

states, including a fresh state, a stale state, and a time out the local traffic packets containing an SA which needs to be 

state, A timer based protocol cache management system is 20 learned (no PACE exists for the SA). 

implemented, in which a fresh entry in the cache remains The rationale or deferring S A learning until packets reach 

fresh as long as it is being used If it is not used for an the Bridge DPM is: 

interval, such as 4 or 5 seconds , then it transitions to the stale 1 . It results in only learning SAs from bridged packets and 

state. If an entry in the stale state remains unused for 20 not from routed packets. 

seconds, it is marked invalid. After 20 seconds of unuse in 25 2. Optimum performance is not required while learning 

the stale state, men the entry is marked invalid If an access new SAs so the additional overhead in leaking some local 

occurs which relies on a stale cache entry, the access will traffic for SA learning is insignificant 

utilize the stale entry, and men the cache management 3. Optimizations can be realized by performing learning 

system will forward a request to the central processor to and bridge age refresh based on batches of packets, 

refresh the entry. If the refresh is received, men the cache 30 4. Overhead incurred within the bdpm_dvrjrter func- 

entry is moved back to the fresh state. The time intervals tion should be mini mi ml since it runs in an interrupts 

utilized for a given protocol address cache vary depending context (SA learning requires creating new PAC entries and 

on the traffic in a particular system having the cache. Thus, sending PCU (Protocol Cache Update) messages to the 

for some protocol types, longer or shorter intervals may be central processor). 

utilized for the transition from fresh to stale, or from stale to 35 All bdpnvjdrvr __filter function return values are listed 

invalid. These intervals should be optimized for a particular below: 

inmlerncntation. 1. BDFMJUXER 

A rate H miring mechanism is used to limit the rate at Local traffic packet to be filtered by BEMD. 

which PCU messages are posted to the central processor. 2. BDPM_JjOCAL__HOST 

Th«» mtr> H mi ring m<v4mntcm in tn pnwmt ihr. central 40 Unicast packets with a DA containing one of the bridge/ 

processor from being flooded oat with PCUs from the router's local AMC addresses (packets addressed to the 

Bridge DPMs following some LAN topology transition. bridge/router itself). These are either packets to be routed or 

A PACE is created for each new SA encountered. The end system packets directed to the bridge/router. 

PACE state is set to indicate that a PCU has been posted to 3 . <hridge_dest^port> 

the central processor. This state is used to prevent additional 45 <bridge_d^st_4>art> is returned far paxkets to be bridged 

PCUs for the same address from being posted to the central when the DALs in the PAC It contains the actual destination 

processor. The PACE is not used to forward data packets port somber the packet should be forwarded to. This allows 

until a PCQ/PCR exchange between the WO and central the Bridge DPM to forward the packet without looking up 

processor completes. New SA PACEs are not created while the destination in the PAC 

PCU generation is inhibited by the PCU rate limiting so 4. BDPM^MUITICAST 

Twham-CTn AH packets with the multicast address bit set in the DA. 

A driver-level packet fitter function is implemented as This includes both multicast and broadcast packets, 

follows. 5. BDFM_UNKNOWN_J)A 

The Bridge DPM driver-level packet filter function, Unicast packets containing a DA which is not in the PAC 

bdpm_dr\T_fiiter, is called by the WO interface driver for 55 or a DA which is in the PAC but does not have a valid port 

each received packet. The bdpm^drvr^lter function wul associated with it (occurs while waiting for a response to 

be called by the Ethernet (or other) LAN Driver; query to locate the DA). The SA may or may not be in the 

The bdpm_dvr__filter function is invoked by BEMD for PAC These are packets to be bridged but the destination is 

each received packet as soon as the packet's destination not yet known, 

address DA and SA have been ready from the controller 60 6. BDPM__LEARN_SA 

receive FIFO. Its primary purpose is to perform driver-level Local traffic packets where no PACE exists for the SA or 

bridge filtering by returning a filter or forward indication to a PACE exists for the SA but the source port in the PACE 

driver. This allows the driver to filter local traffic packets by doesn't match the source port the packet was received on 

issuing a flush command to the controller without reading (occurs when station moves from one port to another). These 

the entire packet from the FIFO. 65 packets are passed to the bridge DPM solely for me purpose 

For packets to be forwarded to DLL Demux, a value is of SA learning and they are discarded after the SA has been 

returned which is used by driver to tag the received packet learned. 
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7. BDPM EXCEFnON 3. No PCR Received 

Unicast packets where a PACE with a valid port exists by This occurs if either the PCQ or PGR is lost The PACE 

the destination media type differs from the source media entry will time-out and the packers) queued to PACE arc 

type. These packets require translation (eg. Ethernet to discarded. The PACE entry is also deleted so that additional 

FDDI) and are forwarded to the central processor for pro- 5 packets for the same DA will result in another PCU. 

cessing as exception packets. It is also returned if the Normal receive path multicast data packet handling is 

destination is an IOM serial port, and the WAN protocol is implemented as follows, 

SMDS or X.25. Inbound multicast data packets in the normal receive data 

The BDPM inbound receiver is implemented as follows. path are received by the bdpm_ibd_rcv_mcast function. 

The BDPM inbound receiver is invoked by inbound 10 Packet floodm 8 is applied to all inbound multicast and 

demultiplexer IBD to receive batches of packets from the unkEOWa packets received in the normal receive 

local WO ports. A separate receive function is used for Hoocung ^ r^onned as follows: 

unicast and multicast batches. L A copy of 1S f <™arded to each of the local 

Three different pairs of multicast and unicast receive „ ™**f <"*^ «* » ™ 

functions are supported* 13 single copy of the packet is forwarded to central 

i «j M rZt ' t * ^ , * ^ ^ processor TB, which floods the packet to all of the active 

^^^™-^™[^**^™^^^ IOM ports, performing transladoTas required. 

^^Z C ^° nS ^ ^ ^ *** 3.A^e^ofLpacketisforwS 
accessing not active. other Bridge DPMs on the other WO cards in the system 

2. The appropriate iirdeast/nuMcast receive function pair 20 without performing translation. The receiving Bridge Dm 
is selected when the BDPM Server calls the snl_ then floods the packet to all the local WO ports. 
regtaerdpm_jag function on the central processor. The BDPM outbound receiver is iinplemcDted as follows: 

3. When the debug or filter versions of the receive Outbound bridged data packets are received from the 
functions are active, all packets received by the unicast and Corebus via IMS and forwarded to the local WO ports, 
multicast receive functions are passed to the central proces- 25 Source address learning is never applied to outbound pack- 
sot without being processed by the Bridge DPM. The ets received by the Bridge DPM. The Bridge DPM directly 
filtering receive path is used whenever rnnemonic or source receives the following packet types at the bdpm__ 
and destination address bridge security filtering is enabled. outbound_jrv function: 

It forces an packets received by the Bridge DPM to be 1. multicast packets from another H/O or the central 
processed by the central processor as exception packets 30 processor. 

since tillering is not supported within the WO card The 2. unknown unicast packets from another WO or the 
debug receive path is only used to facilitate debugging. It central processor 

allows the Bridge DPM to essentially be disabled. Each packet received at bdpm^outbound _rcv is flooded 

The normal inbound receive path is used when filtering to all local HAD ports except the one it was criginal received 
and debug mode are inactive. It is described in detail below. 35 on by passing multiple copies of the packet to the common 
Normal receive path unicast packet handling for the IBD transmit function iM_dpni2port_janit The IBD trans- 
system is done as follows, mit routine updates the statistics and queues the packets for 

Unicast packets in the normal receive path are received by transmission on the WO local ports, 
the bapm_^_Jbd_unicast function, Disposition of uni- The following packet types bypass die Bridge DPM and 
cast packets is based on the packet tag value applied by the 40 are passed directly to the IBD function ibd_cec2port_janit 
Bridge DPM driver-level packet filter function: where they are queued for transmission without updating 

1. <bridge_dest_j>ort>--4)ra statistic (statistics are counted on the CEC): 
filiation port indicated. 1. known unicast packets from IOM ports. 

2. BDPWLJSXCEPnON— Forwarded to the central pro- 2. unicast exception packets. 

cesser ^for processing. 45 Known unicast packets from other WOs in the system 

3. BDPM_LEARN_SA — Source address learning is bypass the Bridge DPM and are received directly by the 
performed and then the packet is discarded. common IBD receive function tod_Jms2pat_junit where 

4. BDPM_UNKNOWN_X>A — A PACE entry is created statistics are updated and the packets are queued for trans- 
fer the DA and the packet is queued to the PACE, a PCQ is mission to the WO ports. 

posted to the central processor. Additional packets received 50 Spanning Tree Protocol support is implemented as fol- 
for die same DA while waiting for the PCR from the central lows. 

processor in response to the PCQ are also queued to the Spanning Tree Protocol (STP) processing is not per- 
PA ^L^ ^ M , . , fanned by the Bridge DPM or Bridge DPM Server on the 

Packet disposition is based on the PCR returned as central processor. Instead, aU STP processing is performed 
follows: 55 by the existing STP component on the central processor. The 

L Destination Found Bridge DPM Server monitors the STP port state (eg. 

Packet is forwarded to the destination port if translation is LISTENING, LEARNING, FORWARDING, BLOCKING) 
not required otherwise it is forwarded to the central proces- of the WO local ports and posts CONFIG messages to the 
sor for processing. The PACE is updated so additional bridge DPMs to inform them of me current port state, 
packets received from the same DA will be tagged as 60 STPBPDUs received on WO local ports are passed by 
<bridge_dest_part> or BDPM_JEXCEFTION by the IBD to the Default DPM and then forwarded on the STP 
^f^vel filter function. component on the central processor bypassing the Bridge 

2. Destination Not Found DPM and Bridge DPM Server. 

Packet is an unknown unicast and is flooded using the Bridge DPM server functionality includes the following, 
same technique used far multicast packets described below. 65 The primary purpose of Bridge DPM Server is to provide 
The PACE entry is also deleted so that additional packets for centralized support on the central processor for the Bridge 
the same DA will result in another PCU. DPMs running on the WO cards. It is responsible fan 
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1. Distributing configuration and status information to the not subject to aging from the PAC and are only removed 
bridge DPMs. from the PAC when explicitly deleted by a CONFIG mes- 

2. Receiving PCQ messages from the Bridge DPMs and fr° m me Brid 8 c DPM ^ 

responding within PCRs. Bridge filter state information is distributed as follows. 

* « . . i- ^ ™ .j nm . * 5 The current bridge filtering state information for both 

3. Receiving PCU usages from t the Bridge DFNfc to filtering and source and destination address 
lea^fr^hsource addresses of stations attached to ^ sccu ^teri ng is eyed to the Bridge DPMs by 
IVO local ports. selecting the normal or filter versions of the unicast and 

4. Accessing debug information maintained by the Bridge multicast receive functions for the BDPM Inbound Receiver 
DPMs. 10 by calling the snLjegister_dpm_tag function with the 

Configuration and status information distribution occurs following tag values: 

as follows. 1, IBD_3DPM_RCV_UNICAST and IBD_BDFM_ 

Configuration and status information is distributed by die RCV_MCAST to select the normal receive functions 

Bridge DPM Server to the Bridge DPMs as part of the 2 . IBD JBDPM_RCV_UNICACT_J 3 ILTER and IBD_ 

Bridge DPM initialization process. The Bridge DPM Server 15 BDFM_RCV_>tCAST JILTER to select the filter ver- 

rcceives an ICM_PROCESS_UP message when each WO sions of ^ receive functions. 

card with a Bridge DPM initializes. The Bridge DPM Server ^ inbound ^ packets ( unicast and multicast) received 

also updates configuration /status information on the bridge ^ ^ are treated as exception packets and 

DPMs whenever changes occur during operation. passed to the central processor processing when the filter 

The following configuration and status information is ^ versioo 0 f the receive functions are active, 

distributed the Bridge DPMs: Slot status information is distributed as follows. 

1. port status: disabled, listening, learning, forwarding, The current slot state information is distributed to the 
blocking (WO local ports only). Bridge DPMs. The Bridge DPM Server is informed of the 

2. bridge control: Bridge or NoBridge, Forward or slot status by ICM_PROCBSS_UP and ICM_FROOBSS_ 
NoForward, Learn or NoLearn, et cetera. 25 DOWN messages. The Bridge DPMs cause the slot state 

3 Local MAC addresses. information to control flooding of mu l ti cas t and unknown 

/ a* ^ * - * ' unicast bridged data packets to other H/Os. 

4. bridge filter state informanon. A gcoc Jc%lot status function may be imr^emeiited od the 

5. flush entries in PAC. TJ/Os. It would directly inform the DPMs about the current 

6. per-pcrt broadcast limit information. 30 status of all slots in the system Slot status distribution by the 

7. current 1?*"* of all slots. Bridge DPM Server to the Bridge DPMs will not be neo 
Port status information is distributed as follows, essary if the generic function is inmlemented. 

The Bridge DPM Server is responsible for contributing Processing of PCQ requests firom bridge DPMs occurs as 

port state information to the Bridge DPMs and purging follows. 

obsolete information in the Bridge DPM PACs. Within the 35 The Bridge DPM queries the central bridge routing table 

central processor, Spaiining Tree Protocol port state transi- for the location of MAC addresses by sending the PCQ 

tions are reported by calling a m_control function. The messages to the Bridge DPM Server. Upon receiving a PCQ, 

tb__control function informs the Bridge DPM Server when- the Bridge DPM Server looks up the MAC addresses) 

ever a port state transition occurs for an WO port and specified in the PCQ in the central bridge routing table 

whenever an K)M port transitions to BLOCKING or DIS- 40 maintained on the central processor and posts a PCR in 

ABLED state, response to (he PCQ. If an address specified in the PCQ is 

When the Bridge DPM server receives a port state tran- found in the central bridge routing table, the destination port 

sitbn indication from tb_control for an WO port, the new number, WAN address if any, and media type is returned in 

port state is passed to the Bridge DPM far the WO port by mePCI^ Uponrccerviiig thePC updates 

issuing a CONFIG message. In addition, the Bridge Dm 45 the PACE and forwards the queued packers). 

Server issues a flush PAC CONFIG message to all Bridge ff an address contained in a PCQ is not found in the 

DPMs whenever a port transition to BLOCKING or DIS- central bridge routing table, Die port field in the PGR is set 

ABLED state. This ensures the Bridge DPM PACs are to UNKNOWN. The UNKNOWN is returned in the PCR, 

purged of obsolete ^formation, the Bridge DPM will flood the packet and delete the PACB 

Local MAC address distribution occurs as follows. 50 for the address. Deleting the PACE causes a new PCQ to be 

The Bridge DPM Server is responsible for distributing the posted by the Bridge DPM to the Bridge DPM Server when 

local MAC addresses to the Bridge DPMs. The local another packet for the same destination is received, 

addresses are distributed to each Bridge DPM in the system SA learning and age refresh in central bridge routing table 

as part of the Bridge DPM initialization process. is handled as follows. 

Additionally, updates are distributed to all Bridge DPMs 55 The Bridge Dm forwards learned MAC addresses to the 

whenever a local address is changed. An example of a local bridge DPM Server in PCU messages. A single PCU can 

MAC address when can change during operation is the contain several learned addresses. Each address entry in the 

special MAC address used by DECNET routing. PCU message contains the MAC address and source port 

The local MAC addresses are distributed by the Bridge number. 

DPM Server to the Bridge DPMs in CONFIG messages. 60 The Bridge DPM Server creates an entry in the central 

Multiple addresses can be packet into a single CONFIG bridge routing table for each learned MAC in the PCU 

message Each address entry in a CONFIG message contains message from the Bridge DPM. These entries are subject to 

the following information; the normal aging process. The Bridge DPMs periodically 

1. action: add or delete address posts PCU messages far active addresses in the PAC learned 

2. MAC address ~ 65 from the local L70 ports. This causes the Bridge DPM 
The Bridge DPM creates a PACE for each local MAC Server to refresh the age of the corresponding entries in the 

address in a received CONFIG message. These addresses are central bridge routing table. The interval for periodically 
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generating PCU messages from the DPMs is less than the 
time required to age out an entry in the central bridge routing 
table which prevents entries from being aged out 
V EXTENSION TO LAN OR WAN BACKBONE 
FIG. 35 illustrates an extension of the present invention to 
a system which replaces the high speed parallel bus of FIG. 
1 with a local area network or wide area network backbone 
generally 2000. For instance, the backbone 2600 might be an 
ATM network coupled to a variety of local area networks 
using virtual circuits, such as discussed in the document 
published by the ATM Forum entitled LAN Emulation 
OverATM Specification— Version 1.0. Thus, a plurality of 
input/output processors, such as IOP 2001, IOP 2002, and 
IOP 2063 are coupled using the interprocess or messaging 
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the backbone physical layer 2000, the IMS communicates 
among the IOPs using the message passing protocol as 
described above. Coupled to the WAN or LAN backbone 
2000, is at least one router in the embodiment shown. A first 
router labeled Router A 2007 is coupled to the backbone 20 
through the interprocessor messaging system 2008. Also, a 
second router 2009 labeled Router B in the figure, is coupled 
to the backbone through the interprocessor messaging sys- 
tem 2010. Each of me input/output processors 2001 through 
2003 and the routers 2007 and 2009 in the figure include a 25 
plurality of network connections which provide interfaces to 
networks which use the router resources distributed amongst 
the processors. More than one router is included in the 
system, This way, the IOPprocessors 2001 through 2003 can 
contain some fault tolerance. For instance, if Router A is 30 
down, a processor may retry a given request to the router by 
sending it to Router B. A variety of protocols can be used to 
optimize performance of the system. For instance, the IOP 
might use Router A for a first transaction and Router B for 



network— the workgroup, the building or campus backbone, 
and the remote and personal offices connected over wide 
area links. It allows these segments to be administered from 
a centralized management system. The end result is an 
enterprise wide network that is suited to the way in which 
companies conduct business. The benefits of the high speed 
scalable networking strategy include expertise can be 
brought together quickly to deliver projects or products 
efficiently and effectively. Also, custom applications can be 
developed and more cost-effectively. The cost of incremen- 
tal computing power drops dramatically because of the 
scalable nature. Finally, the investment in current equipment 
and technology is protected while paving the way for future 
technologies. 

Thus, the scalable platform of the present invention 
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with a wide variety of input/output modules, including other 
full function engines, intelligent I/O modules which perform 
a subset of the routing decisions, and basic I/O modules 
which have no local routing capability and rely on the 
centralized full function routers for such decisions. All of 
these elements are interconnected by a high speed backplane 
bus utilized efficiently according to logical layer intercon- 
nections far the intelligent VO processors, and physical layer 
mterconnection for the basic I/O modules without process- 
ing facilities necessary for managing the logical links. Thus, 
the architecture of die present invention supports growing 
complexity of I/O modules, as well as basic single port 
connections that can be used for incremental growth, and 
backward compatibility in the systems. 

The foregoing description of a preferred embodiment of 
the invention has been presented for purposes of illustration 
and description. It is nrt in to ded to be exhaustive <x to limit 
the invention to the precise forms disclosed. Obviously, 
many modifications and variations wul be apparent to prac- 
titioners skilled in this art It is intended that the scope of the 
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Alternatively, each IOP could be assigned a primary router 
which it relies upon, unless a catastrophic failure in the 
primary router occurs. In which case, its requests are redi- 
rected to the secondary router. 

Because of the mteiprocessor messaging system based on 
the latency and reliability classes of me present invention, 
the scalable internetworking processes are achieved ««ing 
the LAN or WAN backbone, which suffers lost packets from 
time to time. Data in transit is ensured to receive the best 
available throughput across the backbone 2000, while con- « 
trol messages and the Uke are given higher priority, and 
managed to ensure greater reliability than are the high 
throughput, data-in-transit messages. TWs way, the overhead 
associated with high reliability type messages is not 
extended to the data-in-transit, providing substantial 50 
improvements in overall system throughput across the back- 
bone am. 

VL CONCLUSION 

Aoxsrdingly, the present invention provides a high per- 
formance family of bridge/routers which supplies transpar- 
ent communication between all types of interfaces within a 
single chassis, integrating Token Ring, Ethernet, FDDI, 
ATM, and WAN links. The architecture of the present 
invention delivers die power of single or multiprocessor 
options, with a high speed backplane bus for consistently 
fast throughput across all interface ports. 

These resources allow for selecting the most efficient path 
between any two locations, automatically re-routing around 
failures, solving broadcast and security problems, and estab- 
lishing and administering organizational domains* 

The high speed, scalable networking framework accord- 
ing to the present invention encompasses all segments of the 
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equivalents. 
What is claimed is: 

1. An apparatus of interconnecting a plurality of networks, 
comprising: 

a plurality of input/output systems, having input/output 
ports for physical connections to a diversity of net- 
works operating with a plurality of routed network 
protocols, said input/output systems having a plurality 
of variant sets of processing resources; 
an int e rprocessor messaging system, coupled with the 
plurality of input/output systems, including a logical 
layer and a physical layer, for (xajisf erring data-in- 
transit and control signals among the plurality of input/ 
output systems; and 
distributed processing services in the plurality of input/ 
output systems, Including for respective sets of routing 
decisions according to corresponding routed network 
protocols in the plurality of routed network protocols a 
central routing resource in a processor coupled to the 
interprocessor messaging system, and a distributed 
protocol module in a given input/output system in the 
plurality of input/output systems, in which die distrib- 
uted protocol module supports a subset of the respec- 
tive set of routing decisions for the corresponding 
routed network protocol and relies on communications 
across the interprocessor messaging system with the 
central routing resource for routing decisions not in the 
subset of the set of routing decisions for the corre- 
sponding routed network protocol. 
2. The apparatus of claim 1, wherein the processor which 
includes the central routing resource, includes a plurality of 
input/output ports for physical connections to a networks. 
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3. The apparatus of claim 1, wherein the plurality of 
input/output systems comprise a plurality of central routing 
processors coupled to the interprocessor messaging system, 
each including central routing resources for respective 
routed network protocols. 5 

4. The apparatus of claim 1, wherein the central routing 
resource includes resources for ct^nununicating with input/ 
output systems at the physical layer of the interprocessor 
messaging system for input/output systems without logical 
layer processing capability. 10 

5. The apparatus of claim 1, wherein the interprocessor 
messaging system supports messages among the plurality of 
input/output systems according to a plurality of classes 
having different latency and reliability characteristics. 

6. The apparatus of claim 1, wherein the central routing 
resource includes a routing table for the given protocol, and 15 
the distributed protocol module includes a routing table 
cache maintained through the interprocessor messaging sys- 
tem, 

7. The apparatus of claim 1, including a central routing 
processor coupled to the interprocessor messaging system 20 
which includes: 

the central routing resource; 

resources for communicating with input/output systems at 
the physical layer of the interprocessor messaging ^ 
system for input/output systems without logical layer 
processing capability; and 

wherein the interprocessor messaging system supports 
messages among the plurality of input/output systems 
and the central routing processor according to a plu- ^ 
rality of classes having different latency and reliability 
characteristics. 

8. The apparatus of claim 7, wherein the central routing 
resource inclndra a routing table fox at least one routed 
network protocol in the plurality of routed network 35 
protocols , and the distributed protocol module for the at least 
one routed network protocol includes a routing table cache 
maintain^ through the intcrpffocessor messaging system. 

9. The apparatus of claim 1, wherein the interprocessor 

iyimmnn<r«riftn gygtwn Whirls m harthnnr. CTtmmnnifarinn ^ 

medium which comprises a local area network. 

It. The apparatus of claim 1, wherein the interprocessor 
communication system includes a backbone communication 
H^ nm which comprises a wide area network. 

1L The apparatus of claim l t wherein the interprocessor tf 
communicatioa system includes a backbone communication 
™~4{nm which comprises an asynchronous transfer mode 
network executing a process for emulation of a connection- 
less local area network protocol. 

IX An apparatus for interconnecting a plurality of net- ^ 
works through network interface systems having different 
degrees of protocol processing capability, comprising: 
a router processor having processing resources for man- 
aging multiprotocol routing of packets received from 
the plurality of networks, including protocol processing 55 
resources serving the different degrees of protocol 
processing capability in the network interface systems; 
a bus having a plurality of bus slots for network interface 
systems and coupled to the router processor providing 
a data path among the router processor and network go 
interface systems connected in the bus slots; 
a bus communication system run in the router processor 
and network interface systems in the bus slots support- 
ing flow of data-in-transit and control messages among 
the router processor and the bus slots across the bus; 65 
wherein the protocol processing resources provide cen- 
tralized protocol processing far packets forwarded 
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from a particular network interface system which 
receives a packet needing processing for a protocol not 
supported by the particular network interface system, 
and provide distributed protocol processing for proto- 
cols partially supported by network interface systems in 
response to requests from network interface systems. 

13. The apparatus of claim 12, wherein the centralized 
protocol processing includes support for per-packet routing 
and header processing for at least one protocol run in the 
plurality of networks. 

14. The apparatus of claim IS, wherein the distributed 
protocol processing includes management of routing tables 
in the router processor and support of routing table caches in 
network interface systems for at least one protocol run in the 
plurality of networks. 

15. The apparatus of claim 12, wherein the communica- 
tions system supports transfer of data-in-transit and control 
messages among the router processor and the plurality of 
bus slots with reliability classes. 

16. The apparatus of claim 12, wherein the communica- 
tions system supports transfer of data-in-transit and control 
messages among the router processor and the plurality of 
bus slots with latency classes. 

17. The apparatus of claim 12, wherein the router pro- 
cessor includes shared memory resources accessible through 
the bus communication system by the network interface 
systems in the plurality of slots, for holding data-in-transit 

18. An apparatus for interconnecting a plurality of 
networks, comprising: 

a backbone comrourncation medium having a physical 

layer protocol; 
a central routing processor coupled to the backbone, 
including resources for making routing decisions 
according to a plurality of routed network protocols; 
a plurality of input/output modules coupled to the back- 
bone and in communication with the central routing 
processor according to the physical layer protocol, the 
lnput/ou^)ut modules having respective sets of physical 
network interfaces, the set for a given inpinVoutpvt 
module having one or more members; 
an wtexprocessor messaging system in a logical layer 
above the physical layer protocol executed in the 
central routing processor and in a set of one or more 
intelligent inrxityoutput modules within the plurality of 
input/output modules; and 
distributed protocol services executed over the interpro- 
cessor messaging system, including a distributed pro- 
tocol module in at least one lacmber of the set of 
intelligent input/output devices which makes routing 
decisions supported by the distributed protocol module 
according to a corresponding routed network protocol 
in the plurality of routed network protocols, and a 
distributed protocol module server in the central rout- 
ing processor which in response to queries from the 
distributed protocol module makes routmg decisions on 
behalf of the distributed protocol module according to 
the corresponding routed network protocol. 

19. The apparatus of claim IS, wherein a particular 
input/output module in the plurality coupled to the backbone 
includes resources for signalling the central routing proces- 
sor about events across the physical layer protocol, and 
including: 

centralized routing services executed in the central rout- 
ing processor over the physical layer protocol in 
response to events on the particular input/output mod- 
ule which makes routing decisions. 
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20. The apparatus of claim 18, wherein the distributed 26. The apparatus of claim 18, whereon the interprocessor 
protocol services include resources for performing transpar- messaging system supports communication among me cen- 
ent bridging in at least one the of the inputfoutput modules, tral routing processor and the plurality of input/output 

21. The apparatus of claim 18, wherein the corresponding modules with messages in a plurality of reliability classes, 
routed network protocol of the distributed protocol services 5 27. The apparatus of claim 18, wherein the physical layer 
comprises Internet protocol (IP) routing in at least one of the protocol supports communication among the central routing 
mput/oittput modules. processor and the plurality of input/output modules with 

22. The apparatus of claim 18, wherein me interprocessor mcs k a plurality of ^ m ^s. 
messagmg system mc^des resources for transferring contro 28. The apparatus of claim 18, including a second central 
messages and network packets-m-transit among the central 10 M 7T~ V M " uu ™» * 5 . ^ *T 
routini processor and mpuVoutput modules in the set of XOUtm jf £££ , J*f ^ * * 
intelligent input/output modules second P 10 ** 01 module *«ver. 

23. Hie apparatus of claim 18, wherein the distributed 29 of 018(1,1 18 » whcrcin &c backbone 
protocol module includes a protocol routing table cache, and communicadon »*«"um comprises a local area network, 
the distributed protocol module server includes resources for 15 ^ ^ apparatus of claim 18, wherein the backbone 
mflintjrining a central protocol routing table and supporting co mmun ica ti on medium comprises a wide area network, 
the protocol routing table caches. 31. The apparatus of claim 18, wherein the backbone 

24. The apparatus of claim 18, wherein the interprocessor communication medium comprises an asynchronous trans- 
messaging system supports communication among the cen- fcr mode network executing a process for emulation of a 
tral routing processor and the plurality of input/output 20 connectionless local area network protocol. 

modules with messages in a plurality of latency classes. 32. Hie apparatus of claim 18, wherein the backbone 

25. The apparatus of claim 18, wherein the physical layer communication "^"^ comprises a high speed parallel 
protocol supports communication among the central routing bus. 

processor and the plurality of input/output modules with 

messages in a plurality of latency classes. ***** 
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